Papers
Topics
Authors
Recent
2000 character limit reached

DefensiveTokens: Adversarial Prompt Protection

Updated 18 December 2025
  • DefensiveTokens are defined as prompt constructs designed to protect LLMs from unauthorized manipulations using geometric, semantic, and internal representation features.
  • The approach integrates methods like CurvaLID, PromptSleuth, and PIShield, each employing specialized techniques such as curvature analytics, intent abstraction, and layer-specific token extraction.
  • Empirical results demonstrate that these multi-level defenses significantly reduce false positives/negatives, ensuring robust resilience against sophisticated adversarial prompts.

DefensiveTokens are prompt constructs and mechanisms—both implemented and conceptualized—that protect LLMs, agent-based systems, and generative pipelines from unauthorized manipulations, information exfiltration, prompt leakage, structural misinterpretation, or task injection. These tokens serve as integral components for evaluation, detection, mitigation, and recovery in response to evolving adversarial prompt threats. They are foundational to a broad family of methods that subsume geometric, intent-based, internal-representation, and empirical prompt-evaluation strategies, all of which converge on the goal of robust and safe large-model deployment.

1. Theoretical Basis for DefensiveTokens

The concept of DefensiveTokens emerges from the intersection of adversarial robustness and practical prompt engineering for LLM systems. DefensiveTokens encapsulate features, token-level constructs, and derived signatures within prompts or system-level wrappers that either detect, block, or mitigate adversarial manipulation. This framing includes:

  • Geometric attributes: Local curvature and manifold properties of the word-embedding trajectory in the prompt space, as applied in CurvaLID, where adversarial prompts are revealed via deviations in curvature and local intrinsic dimensionality (LID) of their embedding path (Yung et al., 5 Mar 2025).
  • Intent and task abstraction: Explicit mapping from prompt tokens to a discrete task space, enabling semantic invariance checks to distinguish benign from injected tasks, as exemplified by the intent abstraction function I():{prompt strings}TI(\cdot): \{\text{prompt strings}\} \to \mathcal{T} used in PromptSleuth to construct robust task-relationship graphs (Wang et al., 28 Aug 2025).
  • Internal representation features: Extraction of layer-specific residual streams or projected token vectors that are discriminative for contaminated versus clean prompts, as in PIShield’s injection-critical layer approach (Zou et al., 15 Oct 2025).

The DefensiveToken paradigm fundamentally recognizes that traditional surface-form or keyword-based filtering is inadequate against sophisticated paraphrases or obfuscations. Instead, robust defenses require abstraction and featureization at the token or representation level, facilitating more generalizable and efficient detection.

2. Key Defensive Mechanisms and Architectures

Defensive tokens are operationalized via several architectural strategies:

a) Geometric Defense: CurvaLID

CurvaLID employs geometric token analytics, extending the Whewell equation to quantify prompt-curvature in high-dimensional word-embedding space. It leverages local curvature and LID to form classifier inputs that differentiate benign prompts (low curvature, stable LID) from adversarial ones (high curvature, elevated LID), providing a model-agnostic prompt filter (Yung et al., 5 Mar 2025).

b) Semantic Intent Reasoning: PromptSleuth

PromptSleuth abstracts input tokens to concise tasks (2–5 words each) and constructs a bipartite graph via summarizing both system (S) and user (U) prompts. DefensiveTokens, in this formulation, are the summarized task-label tokens, and the relationships (edges) between them are the mechanisms for invariance checks. Injected tasks manifest as isolated or unrelated subgraphs, flagging prompt injection irrespective of surface obfuscations (Wang et al., 28 Aug 2025).

c) Internal Feature Extraction: PIShield

PIShield identifies an injection-critical transformer layer L*, extracting the residual stream token representation for the final prompt token. A defensive token, in this context, is the hidden state vector htLh_t^{L*}, which is classified as clean or contaminated using a logistic regression model. This discriminative approach is highly effective even against adaptive white-box attacks (Zou et al., 15 Oct 2025).

d) Structural and Statistical Filters

Detection modules for multimodal prompt-agent systems—such as those for agent tool misuse (Imprompter)—include DefensiveTokens built on perplexity-based thresholding, token mask anomalies (e.g., high non-English token frequency), and statistical features (length, entropy, punctuation ratio), all of which combine to flag adversarial prompt constructions (Fu et al., 19 Oct 2024).

3. Detection Algorithms and Metrics

The implementation of DefensiveTokens underpins several distinct detection procedures:

Defense Framework Core DefensiveToken Detection Technique FPR/FNR (Best Case)
CurvaLID Curvature, LID features Geometric classifier Not specified
PromptSleuth Summarized task tokens Tuple/graph clustering on task relations FPR ≈ 0.001, FNR ≈ 0
PIShield Layer-specific token vector Logistic reg. on internal feature FPR ≈ 0.4%, FNR ≈ 0
Imprompter Defenses High-PPL, rare token features Statistical/lexical/syntactic anomaly check Detection ≥90% (varies)

4. Empirical Efficacy and Benchmarking

Experimental validation across new and legacy benchmarks demonstrates that DefensiveToken-enhanced systems surpass classical and surface-based methods, especially under paraphrased, obfuscated, and multi-task adversarial scenarios:

  • PromptSleuth-Bench: Demonstrates collapse of surface-based defenses (e.g., DataSentinel FNR ≫ 0.4 on “hard” tiers), while intent-abstraction methods achieve near-zero miss rates (Wang et al., 28 Aug 2025).
  • PIShield: Maintains FPR < 0.4% and FNR ≈ 0% across five datasets and eight attack types, including strong adaptive white-box attacks where other finetuning or PPL-based baselines fail (Zou et al., 15 Oct 2025).
  • CurvaLID: Reveals fundamental geometric separability of adversarial vs. benign prompts, delivering model-agnostic detection that is not tied to LLM internals (Yung et al., 5 Mar 2025).
  • Imprompter-style adversarial prompts: Feature-based detectors analyzing perplexity and token rarity are effective against both information exfiltration and API misuse (Fu et al., 19 Oct 2024).

5. Defense, Recovery, and Forward-Looking Mitigations

Best-practice deployments of DefensiveTokens for real-world applications combine multiple detection and mitigation classes:

  • Pre-execution filtering: Block or flag prompts exceeding high perplexity, rare token ratios, or manifesting unusual tool-call tokens, filtering at both token and representation levels (Fu et al., 19 Oct 2024).
  • Prompt mutation and copy-path disruption: Defenses such as repeated prefixing or fake prompt insertion disrupt transformer attention paths that otherwise facilitate secret-prompt leakage, dramatically reducing “uncovered rates” (UR) by up to 83.8% for Llama2-7B and 71.0% for GPT-3.5 under implicit-extraction attacks (Liang et al., 5 Aug 2024).
  • Composite detection: Integrating semantic-intent invariance checks, geometric feature classifiers, and internal representation detectors yields layered defense—semantic for generalizability, geometric for prompt-space “outliers,” and representation-level for runtime efficiency and robustness.
  • System-integrated detection: PIShield and PromptSleuth architectures wrap their detection logic in protected system messages, preventing attackers from modifying detection code, and are agnostic to user-controlled prompt content (Wang et al., 28 Aug 2025, Zou et al., 15 Oct 2025).

6. Challenges, Limitations, and Research Trajectories

DefensiveTokens frameworks, while effective, exhibit several limitations and open challenges:

  • Model dependence and portability: Methods using internal features (e.g., PIShield’s injection-critical layer) require extraction access, which may not be feasible on black-box APIs, though geometric (CurvaLID) and semantic (PromptSleuth) approaches are more model-agnostic (Yung et al., 5 Mar 2025, Wang et al., 28 Aug 2025).
  • Boundary ambiguity: Detection quality depends on precise system prompt definition—vague or ambiguous policies worsen FPR/FNR (Wang et al., 28 Aug 2025).
  • Semantic adjacency challenge: Adversarial subtasks that are semantically adjacent to allowed tasks may evade even sophisticated intent detectors.
  • Stability of prompt modifications: Perplexity-increasing randomizations (random insertions) can unintentionally enhance memorization, highlighting non-linearity in defense effects (Liang et al., 5 Aug 2024).

Suggested directions include memory-augmented prompt-evolution strategies for robust prompt generation, cross-layer anomaly aggregation, and hybrid integration with anomaly-detection on syntax or token statistics.


DefensiveTokens represent a comprehensive, multi-level technical approach for safeguarding LLMs and agent-based frameworks from prompt-based exploitation. By transitioning from surface cues to geometric, semantic, and internal representation analyses, DefensiveTokens form the foundation for state-of-the-art adversarial prompt detection, mitigation, and recovery systems in contemporary and evolving LLM deployments (Yung et al., 5 Mar 2025, Wang et al., 28 Aug 2025, Zou et al., 15 Oct 2025, Fu et al., 19 Oct 2024, Liang et al., 5 Aug 2024).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to DefensiveTokens.