LLM-Based Honeypots

Updated 19 November 2025

LLM-based honeypots are advanced deception systems that simulate high-fidelity interactive services to engage and analyze malicious activity in real-time.
They integrate multilayered pipelines—including input proxies, session managers, and fine-tuned LLM response generators—to emulate protocols like SSH, database, and industrial systems.
Evaluation metrics demonstrate higher attacker engagement and improved threat intelligence, establishing their effectiveness over traditional honeypots.

LLM-based honeypots are advanced, deception-focused security platforms that leverage LLMs to emulate high-fidelity interactive systems for the purpose of detecting, engaging, and analyzing malicious activity. By fine-tuning LLMs on diverse real-world attacker interactions and pairing them with engineering techniques for system emulation, these honeypots offer dynamic, contextually appropriate responses that closely resemble authentic service behaviors. The result is a significant step beyond static, easily fingerprinted decoys, providing both deeper attacker engagement and richer threat intelligence across a range of protocols and environments (Otal et al., 2024, Adebimpe et al., 24 Oct 2025, Sladić et al., 8 Oct 2025, Malhotra, 1 Sep 2025, Bridges et al., 29 Oct 2025).

1. Design Methodologies and Architectural Patterns

LLM-based honeypots typically follow multilayered pipelines organized around attacker-facing proxies, interaction managers, prompt engineering logic, LLM-driven response engines, and comprehensive logging subsystems. A canonical architecture comprises:

Input Proxy: Handles protocol-level details (SSH, HTTP, database, ICS) and normalizes incoming requests.
Command Routing & Filters: Delegates simple requests to deterministic responders or caches, routing complex/novel queries through the LLM with context and session state.
Session Manager: Maintains per-session state, histories, and simulates file systems or protocol metadata (using sliding window or importance-weighted pruning to fit model context).
Prompt Engineering Pipeline: Constructs system prompts embedding OS/service persona, behavioral rules (e.g., "never admit you are synthetic"), and detailed response instructions, potentially augmented by retrieval-augmented generation (RAG) modules for ground-truth outputs or explanations (Adebimpe et al., 24 Oct 2025).
LLM Response Generation: Uses locally hosted or cloud-based LLMs, frequently fine-tuned via LoRA or full SFT, for output synthesis.
Post-processing & Telemetry: Ensures output fidelity, sanitizes hallucinations, logs all exchanges, and supports analytic pipelines (Bridges et al., 29 Oct 2025, Otal et al., 2024, Sladić et al., 8 Oct 2025).

Representative frameworks include SBASH (focus: RAG vs. prompt-tuned LLM tradeoffs), VelLMes (multi-protocol, multi-service orchestration), and LLMPot (byte-level LLMs for industrial protocols). Table 1 summarizes principal architecture choices:

Framework	Model Size	Protocols	Tuning Method	Storage/State
SBASH	4B/8B/12B	SSH shells	Prompt/RAG	In-memory FS
VelLMes	GPT-3.5/4	SSH/MySQL/POP3	Fine-tuned/Prompt	Per-IP/session
LLMPot	ByT5-small	Modbus/DNP3	Fine-tuned	Context+Logging
LDAP Honeypot	LLaMA 8B	LDAP	Fine-tuned/LoRA	JSON logging

2. Data Collection, Preparation, and Model Training

The realism of LLM-based honeypots is underpinned by comprehensive datasets capturing authentic attacker behavior across a range of protocols and use cases. Data sources include:

Honeypot logs from Cowrie, OpenLDAP, or industrial systems (datasets typically 100s–1000s of unique commands per protocol).
Public command corpora: Top-k Linux/Windows commands, manpage summaries, exploit tool documentation, and scripts (Otal et al., 2024, Adebimpe et al., 24 Oct 2025, Jiménez-Román et al., 20 Sep 2025).
Synthetic augmentation: Manually created command permutations; TLDR-style summaries.
Protocol capture and synthesis for network/PLC protocols (LLMPot: PCAP-based, LDIF-based trees).

Preprocessing employs unicode normalization, deduplication, punctuation/whitespace handling, case-folding, and serialization to JSON or CSVL formats suitable for LLM SFT flows.

Fine-tuning leverages frameworks such as Llama-Factory, Unsloth, FLASH attention for context window efficiency, and LoRA adapters for parameter/memory reduction. Hyperparameters frequently reported include learning rate (α=5×10^-4), batch size (~16 sequences), AdamW optimizer, noisy embedding regularization, and quantization to 8-bit adapters for resource-efficient training (Otal et al., 2024, Jiménez-Román et al., 20 Sep 2025).

3. Interactive Response Generation and Prompt Engineering

High-fidelity deception is achieved through careful prompt engineering strategies. Prompts are designed to strictly emulate the persona of the targeted system and constrain the model to output only what the authentic system would produce, with explicit instructions such as:

"Reply only with the terminal outputs inside one unique code block and nothing else."
"Never explain or reveal you are not a real (Linux shell/MySQL client/POP3 server)."
End responses with expected shell or protocol tokens ($, mysql>, +OK) (Sladić et al., 8 Oct 2025, Otal et al., 2024, Adebimpe et al., 24 Oct 2025).
For RAG approaches, system context is augmented with relevant documentation, sample outputs, or behavioral traces retrieved from a vector database.

Certain systems employ chain-of-thought cues or structured prompting (e.g., HoneyGPT's JSON-constrained output fields for "system state changes" and "impact score"), and use per-session history pruning based on either recency or semantic importance ("weaken factor" pruning) to manage LLM context-window exhaustion (Wang et al., 2024).

Mitigation against prompt injection and LLM fingerprinting is operationalized via wrapper-layer command sanitization, immutable system prompts, and hybrid architectures offloading trivial/known commands to deterministic responders (Adebimpe et al., 24 Oct 2025, Malhotra, 1 Sep 2025).

4. Evaluation Metrics and Empirical Results

Evaluation of LLM-based honeypots spans three complementary methodologies: automated statistical measures, controlled human deception studies, and in-the-wild deployments.

Automated Fidelity Measures:

Cosine Similarity: Between dense embeddings of generated vs. ground-truth outputs.
Jaro-Winkler & Levenshtein Similarity/Distance: Character-level comparison metrics, usually normalized for interpretability.
BLEU, SBERT, BertScore: n-gram overlap and transformer-based precision-recall for natural language/protocol responses.
Protocol-Specific Scores: For LDAP, weighted validity combining syntax, operation structure, field-level correctness, and completeness (Jiménez-Román et al., 20 Sep 2025).

Typical results after fine-tuning show notable improvement:

Llama3-8B post-SFT: sim_cos=0.695, JW=0.599, dist_lev=0.285 (Otal et al., 2024).
SBASH: Non-RAG Levenshtein accuracy ~2–3%, RAG-augmented ~10–20%, with SBERT/BertScore tracking higher overall correspondence (Adebimpe et al., 24 Oct 2025).
LLMPot: Response validity >95% for Modbus/S7Comm after 1.6K tuning samples (Vasilatos et al., 2024).

Human Deception Experiments:

VelLMes field: 30.2% of attackers misclassified the LLM honeypot as real; false suspicion rate against real OS was 65.2% (Sladić et al., 8 Oct 2025).
ShelLM (2023): TNR up to 0.91; attackers noted the indistinguishability of common command outputs, with main discrepancies arising from context losses in long sessions (Sladić et al., 2023).
HoneyGPT achieved attack-success and temptation rates of 98% and 96% respectively, exceeding those observed for Cowrie (Wang et al., 2024).

Operational & Security Metrics:

Latency: Ranges from 1.5–3.5 seconds per LLM invocation depending on model; dictionary/cache paths offer <1 ms (Malhotra, 1 Sep 2025, Adebimpe et al., 24 Oct 2025).
Cost: Continuous 24/7 LLM operation at ~\$40/month on modern GPUs (SBASH).
Engagement: Attack sessions exhibit longer duration and higher command-per-session metrics on LLM-based honeypots vs. traditional ones (Wang et al., 2024, Sladić et al., 8 Oct 2025).

5. Protocol and Service Coverage

While early LLM honeypots focused nearly exclusively on UNIX/Linux shells over SSH, the state of the art now extends to:

Database Protocols: MySQL CLI (VelLMes), LDAP (fine-tuned LLaMA-8B in (Jiménez-Román et al., 20 Sep 2025)).
Email & HTTP: POP3 and HTTP with prompt-driven behavioral constraints.
ICS/PLC Networks: LLMPot fine-tunes ByT5 on protocol wire-formats (Modbus, S7Comm, DNP3), achieving byte-accurate protocol emulation and dynamic process simulation.
Honeytoken Generation: LLMs produce files, honeywords, config objects and more, with modular prompt combinations for optimal indistinguishability and diversity (Reti et al., 2024).
AI-Agent Detection: LLM Agent Honeypot employs prompt-injection and RTT analysis to identify hostile LLM-powered hackers among millions of connections (Reworr et al., 2024).

Newer frameworks (VelLMes, LLMPot) feature multi-persona, multi-protocol orchestration, facilitating rapid addition of new services by configuration and prompt insertion rather than traditional code refactoring (Sladić et al., 8 Oct 2025, Vasilatos et al., 2024).

6. Security, Limitations, and Open Research Directions

LLM-driven deception raises both defense and detection sophistication. On the defense side, the adaptiveness and protocol coverage of LLMs block traditional fingerprinting and facilitate engagement with previously unseen attack vectors, as evidenced by the diversity of ATT&CK techniques collected in field deployments (Wang et al., 2024, Reworr et al., 2024). However, salient limitations and open challenges remain:

Latency & Resource Consumption: LLM inference is substantially slower than real systems and exposes the risk of DoS under sustained novel command streams. Mitigations include hybrid caches for common commands and speculative/batched inference (Adebimpe et al., 24 Oct 2025, Malhotra, 1 Sep 2025).
Session State Realism: Context-window exhaustion and statelessness can introduce inconsistencies; persistent per-IP/session history files or checkpointing to external DBs partially addresses this (Sladić et al., 8 Oct 2025). Research on continuous memory (retrieval, vector-DB context) is active.
Prompt Injection & Adversarial Probing: Advanced attackers or LLM agents may craft inputs targeting prompt weaknesses or dynamic variance in outputs. System prompts must be locked, and explicit input sanitization is vital (Adebimpe et al., 24 Oct 2025, Reworr et al., 2024).
Protocol/OS Extensibility: Each new service/protocol often requires protocol-specific wrappers and, for optimal fidelity, augmenting real/synthetic training data with edge cases and updated exploitation payloads (Vasilatos et al., 2024, Jiménez-Román et al., 20 Sep 2025).
Detection Rates and Evaluation: Deception success against sophisticated adversaries remains uncertain; ongoing research explores closed adversarial ecosystems with autonomous LLM-red-teaming to iteratively refine both attacker and defender logic (Bridges et al., 29 Oct 2025).

Future research priorities include reinforcement learning-based optimization of prompt/corpus choices, modularization for multi-service and federated deployments, and privacy-preserving data-sharing schemes for global threat-intelligence enhancement (Bridges et al., 29 Oct 2025).

7. Impact on Threat Intelligence and Deception Engineering

LLM-based honeypots yield significant advances in adversary engagement, session richness, and threat intelligence collection. Log analysis pipelines have evolved from raw cluster-and-anomaly mining towards semantic maps of TTPs (e.g., via BERT or GPT-based MITRE ATT&CK mappings) (Bridges et al., 29 Oct 2025). RAG-based detection and real-time automated TTP labeling enable integration with SOAR and SIEM platforms, facilitating rapid, actionable forensics.

Comprehensive field and controlled studies repeatedly show that LLM-driven honeypots keep attackers interacting longer and capture a broader range of behavioral traces than static or script-only approaches. These systems are rapidly being adopted both in research and enterprise threat intelligence settings as scalable deception platforms, with open-source blueprints enabling wider community iteration (Otal et al., 2024, Sladić et al., 8 Oct 2025).

Key references: