PromptSleuth: Secure Prompt Defense
- PromptSleuth is a suite of methodologies, tools, and benchmarks designed to detect, analyze, and defend against LLM prompt injections and manipulations using semantic intent analysis.
- It utilizes a multi-step pipeline including task summarization, relation graph generation, and threshold-based detection to accurately flag unauthorized prompts.
- Benchmark results show superior performance with minimal latency, making PromptSleuth a practical solution for secure, risk-sensitive LLM deployments.
PromptSleuth refers to a family of methodologies, tools, and benchmarks developed to detect, analyze, and defend against prompt injection, prompt extraction, and related threats targeting LLMs and their applications. The term encompasses both semantic intent-based detection frameworks for prompt injection in open-text LLM deployments, as well as empirical investigations of prompt reverse engineering, leakage, and risk assessment in LLM-powered systems. Across these axes, PromptSleuth establishes semantic, structural, and practical criteria to differentiate benign prompts, adversarial manipulations, and recovered prompts, while achieving strong detection performance on state-of-the-art benchmarks and providing a foundation for secure deployment of LLMs.
1. Semantic Intent Invariance for Prompt Injection Detection
PromptSleuth introduces a semantic intent-based framework for detecting prompt injection by identifying when user inputs covertly request tasks outside the scope authorized by a system prompt. The primary insight is that, regardless of linguistic surface form, any adversarial prompt must semantically introduce an unauthorized intent distinct from those specified by the system owner (Wang et al., 28 Aug 2025).
The detection pipeline proceeds as follows:
- Task Summarization: Both system prompt () and user input () are mapped via a summarization function to sets of abstract task labels (each label comprising 2–5 words).
- Task-Relationship Graph: For each pair , where , , a detector LLM judges whether they are “related” or “unrelated.”
- Injection Detection: If any child task is unrelated to all , the prompt is flagged as an injection.
Formally, for task embeddings ,
for a threshold (e.g., 0.7), or using explicit LLM-based relation labels (“related”/“unrelated”).
This approach is robust to paraphrase, obfuscation, and multi-task injection, since it does not rely on surface patterns but on semantic intent.
2. Threat Model, Benchmarking, and Detection Pipeline
PromptSleuth’s threat model considers an attacker with black-box access to an LLM service, able to inject arbitrary strings into user-controlled input fields including free-text, metadata, or context windows, with the goal of coercing the model into unauthorized actions (e.g., system prompt forgery, user prompt camouflage, or complex behavioral manipulation).
To systematically assess defenses, PromptSleuth introduces PromptSleuth-Bench, a superset benchmark that extends prior datasets by:
- Incorporating context tampering, instruction wrapping (JSON/XML), and payload splitting.
- Including behavioral manipulations such as emotional appeals, reward framing, and narrative context switching.
- Supporting multi-task scenarios with interleaved benign and malicious instructions.
- Grading examples by three levels of difficulty (easy: single-task/legacy, medium: single-task/new, hard: multi-task/novel).
On this benchmark, PromptSleuth’s semantic-intent detection dramatically reduces False Negative Rate (FNR) compared to DataSentinel, SecAlign, and PromptArmor, even as attackers adapt with new paraphrased or obfuscated strategies (Wang et al., 28 Aug 2025).
| Defense | FPR / FNR (PromptSleuth-Bench) |
|---|---|
| DataSentinel | 0.0498 / 0.6669 |
| SecAlign | 0.4547 / 0.4947 |
| PromptArmor | 0.0926 / 0.0825 |
| PromptSleuth-5 | 0.0008 / 0.0007 |
3. Runtime Efficiency and Practical Deployment
PromptSleuth implements a lightweight detection pipeline requiring three LLM API calls (summarization, relation inference, and decision):
- Latency: for GPT-4.1-mini and GPT-5-mini, mean latency is under 2 s (1.78 s / 13.61 s, respectively), marginally higher (+3–9%) than PromptArmor and far below GPU-based baselines (10–20 s).
- Token cost scales linearly, with negligible incremental overhead given the low volume relative to LLM inference.
The framework is compatible with moderation pipelines in both provider and client-side LLM deployments, assuming the base detector LLM is sufficiently capable at semantic abstraction and relation reasoning. The system is particularly suited for risk-sensitive scenarios where adversarial prompt injection confers severe operational or privacy risk.
4. Comparative Analysis, Limitations, and Extensions
PromptSleuth’s semantic defense paradigm corrects for weaknesses in syntactic or prompt-pattern detectors. Legacy models (e.g., DataSentinel, SecAlign) exhibit high FNR when attackers exploit obscured or indirect instructions, due to their overreliance on surface regularities.
However, PromptSleuth can blur subtle distinctions (“book cheapest hotel” vs “book most expensive hotel”) if the summarization step produces coarse task clusters. Detection quality depends strongly on the abstraction and relation-judgment capabilities of the detector LLM. Deployment in very high-throughput or latency-constrained applications may need hybrid models that pre-filter with shallow syntactic methods before invoking semantic reasoning (Wang et al., 28 Aug 2025).
Future work includes: integrating graph-tracking across conversational or multi-agent sessions (resilient to stored/chained injections); and combining semantic graph analysis with keyword, perplexity, or LID-based detectors for comprehensive coverage.
5. Relation to Prompt Extraction and Reverse Engineering
PromptSleuth methodologies also encompass prompt extraction and “prompt stealing” scenarios, where the adversary aims to reconstruct the original system prompt based on the model’s output. Here, “PromptSleuth” denotes the two-stage attack described by Sha et al. (Sha et al., 20 Feb 2024):
- Parameter Extraction: Embedding-based classification to predict prompt type (direct, role-based, in-context), roles, and context count, using fine-tuned BERT models.
- Prompt Reconstruction: Synthesis of a reversed prompt by asking the LLM to generate a plausible input for the observed answer, then augmenting it with inferred structure (role headers or example contexts).
Empirical results show that attackers can recover both the surface form and downstream behavior of original prompts with notable fidelity (prompt similarity up to 0.832, answer similarity up to 0.768). Defenses such as prompt-level or answer-level paraphrasing reduce reconstructability but can also degrade answer quality.
6. Relevance to Prompt Safeguards and AI System Security
The PromptSleuth design philosophy of semantic task-level reasoning underpins scalable defenses against a wide spectrum of prompt-based threats. This is further evidenced by its applicability in risk-centric frameworks for promptware and indirect prompt injection in LLM-powered assistants, such as those operationalized in the TARA framework for Google Gemini (Nassi et al., 16 Aug 2025).
Best-practice recommendations emerging from PromptSleuth-centered literature include:
- Threat-model-driven scoring of risk across digital and physical endpoints.
- A/B differential behavior checks for untrusted inputs in production systems.
- Context separation, dynamic confirmation, and adversarial replay validation for robust mitigation.
By refocusing defense strategy from syntactic markers to semantic invariance, PromptSleuth establishes a technically rigorous, model-agnostic foundation for securing next-generation LLM applications against prompt injection and prompt manipulation campaigns.