LLM-Based Honeypot: Adaptive Deception

Updated 27 September 2025

LLM-based honeypots are deception systems that use large language models to generate dynamic, context-aware responses across various protocols.
They integrate protocol proxies, session management, and adaptive response generation to emulate interactive environments such as SSH, ICS, and LDAP.
These systems enhance threat intelligence by leveraging quantitative evaluation, dynamic honeytoken generation, and robust behavioral mimicry.

An LLM-based honeypot is a deception-oriented system that employs LLMs to generate realistic, context-aware responses for cyber defenders to lure, observe, and analyze attacker interactions. These systems represent a marked advancement over traditional static honeypots by offering high-fidelity emulation, adaptive countermeasures, and richer intelligence collection across a broad range of cyber contexts, including conventional shell environments, network protocols, industrial control, and even orchestrated behavioral decoys.

1. Principles and Architecture of LLM-based Honeypots

LLM-based honeypots combine generative LLMs (such as GPT-3.5, GPT-4, Llama3, Phi3, Gemini) with conventional honeypot frameworks to produce convincing, dynamic responses in real time. A typical architectural stack comprises several interconnected modules:

Protocol/Service Proxy: Listens for incoming connections and translates low-level requests (e.g., SSH, LDAP, Modbus, OPC UA) into structured representations suitable for LLM processing.
State/Session Management: Maintains session context, emulated filesystem state, and command or transaction history for continuity and realism.
LLM Backend: A fine-tuned or prompt-engineered LLM generates protocol-compliant, contextually correct outputs, leveraging hybrid strategies (e.g., dictionary lookups for common commands plus LLM for novel queries) for both latency and fidelity.
Logging and Analysis: Records all interaction details, including timestamped requests and responses, system state changes, and attacker metadata for forensic and intelligence purposes.
System Integration: Adapts legacy or production honeypot stacks (such as Cowrie, AsyncSSH, or custom protocol servers) to interoperate with the LLM-based backend (Sladić et al., 2023, Otal et al., 12 Sep 2024, Malhotra, 1 Sep 2025, Jiménez-Román et al., 20 Sep 2025).

These honeypots often operate alongside or within infrastructures leveraging Elastic Stack (Filebeat, Logstash, Elasticsearch, Kibana) for full-spectrum telemetry and visualization (Chacon et al., 2020).

2. Dynamic Response Generation and Realism

The central innovation of LLM-based honeypots is their ability to craft dynamic, context-aware, and high-fidelity emulated responses, raising the barrier for attacker detection and inducing prolonged engagement.

Dynamic Shell Simulation

Systems such as shelLM and LLMHoney employ prompt-engineering (“personality prompts”) and session context concatenation to yield outputs that mimic a real Linux terminal, including nuanced errors and system state transitions. The session history is curated to stay within token limits, and CoT prompting is used for intermediate reasoning (Sladić et al., 2023, Malhotra, 1 Sep 2025).
In one evaluation, shelLM achieved a True Negative Rate (TNR) of 0.90 and overall accuracy of 92%, denoting that outputs were recognized as realistic in 90% of cases by expert reviewers (Sladić et al., 2023).

Protocol Emulation

LLMPot leverages byte-level LLMs (ByT5) trained on real PLC request–response data (captured and encoded in hexadecimal) to simulate industrial protocols (Modbus, S7Comm) at the network level (Vasilatos et al., 9 May 2024).
In the LDAP context, an orchestrator parses ASN.1/BER-encoded requests to JSON, sends to an LLM (LLaMA 3.1 fine-tuned), and reconstructs responses, allowing highly adaptive behavior and context-aware fields (e.g., language, organizational metadata) (Jiménez-Román et al., 20 Sep 2025).

Honeytoken Generation

LLMs are systematically tasked with generating honeytokens of various types—including plausible robots.txt entries, honeywords, log files, and configuration data. A modular prompt architecture (Generator Instructions, User Input, Special Instructions, Output Format) enables the large-scale synthesis of decoys that closely mimic real artifacts, with quantitative metrics showing LLM-generated honeywords are less distinguishable from real passwords than those created with traditional methods (detection rates down to ~14%) (Reti et al., 24 Apr 2024).

3. Evaluation Metrics and Performance Considerations

Rigorous evaluation combines both subjective expert reviews and objective metrics:

Textual Similarity: Cosine Similarity, Jaro-Winkler, Levenshtein Distance, BLEU score are used to compare generated outputs with ground-truth shell responses (Otal et al., 12 Sep 2024, Malhotra, 1 Sep 2025).
Accuracy Metrics: Exact match, True/False Positive/Negative Rates, confusion matrices (Sladić et al., 2023).
Protocol Correctness: For LDAP, a Weighted Validity Score combines factors: Syntax, Structure, KeyFields, Completeness ( $Weighted = 0.4\cdot Syntax + 0.3\cdot Structure + 0.2\cdot KeyFields + 0.1\cdot Completeness$ for search ops) (Jiménez-Román et al., 20 Sep 2025).
Response Latency: Measured end-to-end, with values ranging from ~180 ms (LLMPot for ICS) to 3–3.6 s for LLMHoney (Gemini-2.0, Qwen2.5, or Phi3 models) (Vasilatos et al., 9 May 2024, Malhotra, 1 Sep 2025).
Memory Overhead: Calculated on a per-command basis to ensure deployment feasibility on resource-constrained hosts (Malhotra, 1 Sep 2025).
Hallucination Rate: Proportion of model outputs inconsistent with known system states or protocol progress (mean 5–13% depending on model and configuration; Gemini-2.0 ~13%, Phi3 ~6%) (Malhotra, 1 Sep 2025).

Hybrid architectures (combining LLM responses for novel commands and dictionary-based responses for common commands) optimize this trade-off between latency, realism, and resource cost.

4. Adaptive Deception, Behavioral Mimicry, and Personality Modelling

LLM-based honeypots enable novel forms of adaptive deception and agent-based mimicry not possible with conventional systems.

Adaptive Honeytokens: LLMs can generate contextually shifting decoys (e.g., fake credentials or code comments) that evolve in real time, responding to attacker probes or reconnaissance patterns and defeating static blacklist scans (Chacon et al., 2020, Reti et al., 24 Apr 2024).
Intent/Source Attribution: By analyzing the nature and sequence of interactions with honeytokens (e.g., correct use of embedded credentials, minor edits to decoy data), these systems can discriminate automated tools from human operators, raising incident severity when manual behaviors are identified (Chacon et al., 2020).
Personality-driven Agents: SANDMAN architecture shows that LLM persona induction—mapping the five-factor OCEAN model onto agent behavior via prompt engineering—statistically alters task schedules and decision patterns. This supports high-fidelity behavioral deception, confounding attackers monitoring for synthetic activity by simulating natural human variations in schedule and priorities (Newsham et al., 25 Mar 2025).
Agent Detection: Honeypots augmented for agent detection employ prompt injection (“goal hijacking” and “prompt stealing”) plus response-timing analysis to identify LLM agent-driven attacks. Sessions responding to injected challenges within 1.5 seconds and susceptible to prompt manipulation are flagged as potential AI agents (Reworr et al., 17 Oct 2024).

5. Application Domains and Deployment Strategies

LLM-based honeypots have been evaluated in multiple settings:

SSH and Shell Environments: Systems such as shelLM, LLMHoney, and LLM Honeypot have demonstrated robust capacity to emulate interactive Bash prompts over SSH/Telnet, engaging both brute-force malware and sophisticated manual attackers (Sladić et al., 2023, Otal et al., 12 Sep 2024, Malhotra, 1 Sep 2025).
Industrial Control Systems: LLMPot automatically generates high-interaction decoys for ICS protocols through fine-tuned, vendor-agnostic LLMs, enabling adaptation to arbitrary PLCs without manual protocol emulation (Vasilatos et al., 9 May 2024).
Directory Services: Intelligent LDAP honeypots generate dynamic, context-sensitive responses, leveraging fine-tuned LLaMA models to model complex directory schema (e.g., organization-specific DNs, attributes) and supporting adaptive interaction volumes by intelligent token allocation (Jiménez-Román et al., 20 Sep 2025).

The following table summarizes deployment types and LLM integration:

Application Area	LLM Role	Integration Point
SSH/Bash Simulation	Shell output generation	Interactive shell cmds
ICS/PLC Emulation	Protocol request/response	Network protocol R/R
LDAP Directory	Attribute/value simulation	Search/modify ops
Honeytoken Creation	Decoy data generation	Planted credentials, logs

6. Challenges, Limitations, and Future Directions

LLM-based honeypots present significant advancements but also introduce notable challenges:

Integration Complexity: Tight coupling between conventional honeypot infrastructure and LLM inference engines (especially for real-time sessions) poses engineering challenges—particularly around session management, prompt tracking, and low-latency parallelization (Chacon et al., 2020, Malhotra, 1 Sep 2025).
Hallucination and Consistency: LLMs occasionally output syntactically correct but semantically inconsistent responses (hallucinations). Virtual file system state tracking, session logs, and prompt context management are critical mitigation strategies, but the issue is not entirely eliminated with current generation models (Malhotra, 1 Sep 2025, Sladić et al., 2023, Otal et al., 12 Sep 2024).
Scalability and Resource Usage: Despite advances in quantization (e.g., QLoRA/LoRA), running LLMs—particularly in live, interactive honeypots—remains resource-intensive relative to traditional stateless deception tools (Otal et al., 12 Sep 2024, Jiménez-Román et al., 20 Sep 2025).
Potential for Detection and Adversarial Probing: As LLM-based deception proliferates, sophisticated attackers may develop fingerprinting or adversarial prompt-injection strategies to expose honeypot behavior (Reworr et al., 17 Oct 2024). A plausible implication is that an arms race between model-based defenders and attackers employing LLMs for offense as well as reconnaissance will accelerate.
Ongoing Research: Future efforts target finer-grained behavioral simulation, integrating adaptive decision cycles, multi-agent coordination, and feedback-driven prompt/context auto-tuning. Progress in reducing hallucination, improving latency, and dynamic context management (e.g., with enhanced memory architectures) remains a focus (Newsham et al., 25 Mar 2025, Vasilatos et al., 9 May 2024, Otal et al., 12 Sep 2024, Malhotra, 1 Sep 2025).

7. Comparative Context and Evolution

An LLM-based honeypot fundamentally differs from both traditional low-interaction decoys (static responses, simple event logging) and script-driven dynamic systems by combining natural language understanding, protocol adaptation, and behavioral mimicry within a single framework. Compared to LSTM-based generative honeypots for industrial process replication—which excel at continuous signal modeling but not interactive command handling—LLM-based systems provide broader protocol and semantic flexibility, spanning from protocol conversation to complex user interaction (Sassnick et al., 28 Oct 2024, Vasilatos et al., 9 May 2024, Jiménez-Román et al., 20 Sep 2025).

The progressive integration of LLMs into honeypot design reflects a paradigm shift toward adaptive, high-fidelity, and behaviorally flexible deception platforms that simultaneously facilitate enhanced threat intelligence, improved attack attribution, and the potential for adversarial agent detection via interactive and analytic techniques.