LLM-Based Honeypots
- LLM-based honeypots are deception systems that generate realistic decoy artifacts and system responses to lure and analyze threat actors.
- They integrate dynamic command emulation, hybrid response pipelines, and fine-tuned prompt engineering to enhance realism and scalability in cybersecurity.
- Implementation demonstrates high session accuracy and TNR while addressing challenges such as latency, state management, and response consistency.
LLM-based honeypots are deception systems designed to lure, interact with, and analyze threat actors using the generative and contextual capabilities of state-of-the-art LLMs. These honeypots exploit the adaptability, context sensitivity, and interactive realism of LLMs to address the deficiencies of traditional static or deterministic honeypots, with applications ranging from real-time shell or protocol emulation to scalable honeytoken generation, attack classification, and autonomous agent monitoring.
1. Key Principles and Design Patterns
LLM-based honeypots integrate LLMs primarily to (a) dynamically generate system responses and decoy artifacts, (b) enhance attack engagement and telemetry, (c) enable semantic and intent-level attack classification, and (d) automate otherwise labor-intensive deception engineering.
Major Design Elements
- Dynamic Command/Protocol Emulation: LLMs are prompted to generate output in the style of target systems (e.g., Linux shells (Sladić et al., 2023, Otal et al., 12 Sep 2024, Malhotra, 1 Sep 2025), LDAP servers (Jiménez-Román et al., 20 Sep 2025), or ICS protocols (Vasilatos et al., 9 May 2024)).
- Personality and Context Prompting: Session-specific personality prompts (e.g., instructing the LLM to consistently mimic a Linux terminal) and engineered session histories ensure output realism and internal coherence (Sladić et al., 2023, Otal et al., 12 Sep 2024, Malhotra, 1 Sep 2025).
- Hybrid Response Pipelines: Frequently used commands or protocol elements are answered via dictionary/cache for latency and performance, while novel or context-specific queries are forwarded to the LLM backend (Malhotra, 1 Sep 2025).
- Fine-Tuning and Prompt Engineering: Application-specific datasets and prompts tune LLM output structure and semantics to closely adhere to target system behavior and protocol grammars (Jiménez-Román et al., 20 Sep 2025, Vasilatos et al., 9 May 2024).
- Scalable Honeytoken Synthesis: Modular prompt architectures allow LLMs to generate a broad spectrum of honeytokens (e.g., fake credentials, realistic robots.txt, log files, configuration files) at scale (Reti et al., 24 Apr 2024).
Design Aspect | Example Implementation | Source |
---|---|---|
Shell emulation | Persona-prompted LLM, GPT-3.5 | (Sladić et al., 2023) |
LDAP honeypot | ASN.1/BER–JSON bridging, LoRA tuning | (Jiménez-Román et al., 20 Sep 2025) |
ICS protocol | ByT5 bytes-to-bytes model, PCAP data | (Vasilatos et al., 9 May 2024) |
Honeytoken gen. | Modular prompts, multi-LLM eval | (Reti et al., 24 Apr 2024) |
2. Methodologies: Integration, Training, and Performance
LLM-based honeypots employ standardized methodologies spanning architectural integration, model training, and protocol- or system-specific customization.
Architecture and Workflow
- Input Listening and Parsing: Custom servers (SSH, LDAP, industrial TCP/IP) decode and preprocess incoming requests, often transforming to normalized intermediate representations (e.g., JSON).
- Orchestration and Prompt Chaining: Orchestrator modules manage protocol state, preserve session and semantic context, and compose prompts that guide the LLM.
- LLM Invocation and Output Validation: Requests are forwarded to fine-tuned LLM instances (often via HTTP API, LangChain, or direct inference), with outputs checked/refined before returning to the attacker.
- Session Logging: All interactions, including command, response, metadata, and timing information, are logged for threat intelligence and evaluation.
Model Training and Tuning
- Supervised fine-tuning is employed to align model output with protocol or filesystem structure, minimize semantic/grammatical errors, and ensure “ground truth” compliance (e.g., pairing bindRequest with bindResponse in LDAP (Jiménez-Román et al., 20 Sep 2025), matching PLC byte responses in ICS (Vasilatos et al., 9 May 2024)).
- Low-Rank Adaptation (LoRA) and QLoRA minimize the computational and memory footprint of training (found in Linux shell (Otal et al., 12 Sep 2024) and LDAP honeypots (Jiménez-Román et al., 20 Sep 2025)).
- Prompt Engineering and LoRA are leveraged to reinforce persona consistency and ensure output structure.
Evaluation Metrics
- Syntax and Structure Pass Rates: Percentage of parseable semantically valid LLM outputs (Jiménez-Román et al., 20 Sep 2025).
- Exact-match, Cosine, Jaro-Winkler, Levenshtein, BLEU-4: Output similarity against real system responses (Malhotra, 1 Sep 2025, Otal et al., 12 Sep 2024).
- Latency and Memory Overhead: Wall-time response and resource usage (crucial for real-time SSH/LDAP) (Malhotra, 1 Sep 2025).
- Attack Distinguishability: For honeytokens, metrics such as flatness and trawling attack success probability (e.g., 15.15% for LLM-generated honeywords vs. 29–32% for prior art (Reti et al., 24 Apr 2024)).
3. Application Domains and System Types
LLM-powered honeypots span multiple system types and cybersecurity roles:
Shell-based and Protocol Honeypots
- Linux Shell Decoys: LLMs simulate interactive Linux shells, engaging attackers with context-aware responses, and demonstrating TNR (True Negative Rate) ≈ 0.90 and overall accuracy ≈ 0.92 (shelLM (Sladić et al., 2023); LLMHoney (Malhotra, 1 Sep 2025)).
- LDAP Emulation: LLM generates ASN.1/BER-encoded LDAP protocol responses, preserving field-level correctness and connection semantics; post-fine-tuning, weighted validity score reached ≈99% (Jiménez-Román et al., 20 Sep 2025).
- ICS/SCADA Environments: ByT5-based LLMs reproduce Modbus/S7Comm network behavior and physical process logic at both protocol and functional levels; response validity (RVA) saturates with ≤1600 samples (Vasilatos et al., 9 May 2024).
Honeytoken Generation
- Scalable, Modular Generation: LLMs, via modular prompt building blocks, produce honeywords, robots.txt, config files, logs, database entries; LLM-honeywords achieved lower trawling attackability (success ≈ 15.15%) than previous heuristics (Reti et al., 24 Apr 2024).
- Cross-Model Evaluation: Output realism and syntax vary by LLM (GPT-3.5, GPT-4, Gemini, LLaMA-2), with prompt optimality not always transferable between LLMs (Reti et al., 24 Apr 2024).
LLM-Agent and Adversarial Monitoring
- Agent Detection: Honeypots detect and distinguish LLM-based hacking agents by strategically embedding prompt injections and time-based analysis—scripted to hijack agent goals or steal prompt contents; response times ≤1.5 s are indicative (Reworr et al., 17 Oct 2024).
4. Impact, Benefits, and Comparative Advantages
LLM-based honeypots are characterized by improvements in realism, adaptability, and intelligence collection, with measurable benefits over traditional designs.
Enhanced Realism and Attacker Engagement
- LLMs produce context-sensitive, previously-unseen file structures and system behaviors, increasing attacker dwell time and hindering simple detection (Sladić et al., 2023, Malhotra, 1 Sep 2025, Jiménez-Román et al., 20 Sep 2025).
- Experiments revealed TNR ≈ 0.90 and overall session accuracy ≈ 0.92, confirming human attackers struggled to distinguish LLM outputs from real Linux shell responses (Sladić et al., 2023).
Scalability and Versatility
- Modular prompt architectures support dynamic generation of diverse deception artifacts without retraining or dictionary expansion (Reti et al., 24 Apr 2024).
- In LDAP and ICS honeypots, LLM automation significantly reduced manual protocol scripting, with fine-tuning ensuring field and context correctness (syntax pass up to 100%) (Jiménez-Román et al., 20 Sep 2025, Vasilatos et al., 9 May 2024).
Incident Analysis and Threat Intelligence
- LLMs can both synthesize attacker narratives and classify interaction types (automated vs. human-driven), providing contextual severity scores and incident summaries (Chacon et al., 2020).
- Dynamic, longer dialogues enable exposure and analysis of advanced tactics, techniques, and procedures (TTPs) as well as AI-agent attack behaviors (Reworr et al., 17 Oct 2024).
Comparative Limitations and Trade-offs
- LLM-based honeypots incur non-trivial cost and latency; e.g., LLMHoney’s Gemini-2.0 backend averages ≈3 s per response, and cloud-based deployments estimate ≈US$0.8 per active hour (Sladić et al., 2023, Malhotra, 1 Sep 2025).
- Occasional hallucinated or inconsistent outputs persist, especially in small LLMs or without careful prompt state management; fallback to cached/dictionary responses and sanitization checks are used as mitigations (Malhotra, 1 Sep 2025).
5. Challenges and Practical Constraints
Deployment and operationalization of LLM-based honeypots present a distinct set of technical and resource challenges:
- Latency and Compute Overhead: Achieving low-latency (≤3 s) and resource efficiency while maintaining interaction fidelity; large models (>3B params) require more memory and hardware acceleration (Malhotra, 1 Sep 2025).
- State and Consistency Management: Preventing contradictory or out-of-character responses across extended attacker sessions (addressed by session context saving, prompt updating, and local state management) (Sladić et al., 2023, Malhotra, 1 Sep 2025).
- Scalability of Realistic Artifacts: Prompt engineering must be robust to prompt injection and brittle generalization across LLMs (Reti et al., 24 Apr 2024, Reworr et al., 17 Oct 2024).
- Interpretability and Robustness: LLM-based classification/detection outputs often lack direct interpretability; adversarial adaptation and prompt “gaming” are anticipated, necessitating ongoing retraining and defensive hardening (Chacon et al., 2020, Reworr et al., 17 Oct 2024).
- Security of the Deception Environment: Robust containment is mandatory; LLM must never execute input commands; protocol emulation must avoid accidental exposure of sensitive backends (Malhotra, 1 Sep 2025, Jiménez-Román et al., 20 Sep 2025).
6. Future Directions and Open Research Problems
LLM-based honeypot deployment and sophistication are expected to evolve alongside both attacker capabilities and LLM technology:
- Long-Term State Modeling: Expansion to large context models (≥16k–32k tokens) or external memory mechanisms to capture protracted attacker engagement (Malhotra, 1 Sep 2025).
- Automated Output Validation: Integration of secondary discriminators (regex, rule-based, or additional LLMs) to automatically detect and flag hallucinations or incoherence (Malhotra, 1 Sep 2025).
- Dynamic Signal/Noise Balancing in Prompting: Auto-tuning prompt elements using live evaluation metrics or discriminator feedback loops (Reti et al., 24 Apr 2024).
- Broader Protocol and Application Coverage: Generalization to further protocols (e.g., SMB, RDP, HTTP/2), multitasking LLM-backed honeypots, and industrial/specialized OT/ICS contexts (Vasilatos et al., 9 May 2024, Jiménez-Román et al., 20 Sep 2025).
- Adversarial/Agent Behavioral Analysis: Advanced detection and classification of autonomous LLM hacking agents and integration of public dashboards for live threat monitoring (Reworr et al., 17 Oct 2024).
- Resource and Response Optimization: Model quantization, hardware acceleration, and hybrid LLM/dictionary designs to improve response time for large-scale deployments (Malhotra, 1 Sep 2025, Jiménez-Román et al., 20 Sep 2025).
7. Comparative Evaluation and Practical Implications
Experimental results across representative LLM-powered honeypots indicate superior deception fidelity and threat intelligence compared to rule-based or static decoy systems.
Metric/Aspect | LLM Honeypots (SSH, LDAP) | Traditional Systems |
---|---|---|
Realistic Output (TNR) | ≈ 0.90 (Sladić et al., 2023, Malhotra, 1 Sep 2025), ≥99% weighted | Varies, often <80% |
Session Consistency | Multi-turn, context-dependent (Sladić et al., 2023) | Typically stateless or FSM |
Attack Engagement | Prolonged, genuine interactions (Sladić et al., 2023) | Short/lower attacker dwell |
Honeytoken Diversity | 7 types, auto-generated, low attacker success | Limited, manual or coarse |
A plausible implication is that, as LLM cost, latency, and integration maturity improve, these deception techniques will become foundational in early warning, adaptive defense, and threat hunting workflows across diverse cyber operations. As the capabilities and accessibility of LLMs expand, so will their utility and importance within the deception and active defense paradigm.