LLMHoney: LLM-driven SSH Honeypot
- LLMHoney is a real-time SSH honeypot that integrates LLM-driven dynamic response generation to enhance threat intelligence collection.
- It employs a hybrid approach combining a latency-optimized command cache with on-demand LLM inference for stateful, believable session interactions.
- Empirical evaluations show that while high-fidelity LLMs improve realism, they introduce higher computational overhead compared to traditional static honeypots.
LLMHoney is a real-time SSH honeypot platform that leverages LLMs to dynamically generate believable, stateful command outputs for attacker engagement and threat intelligence collection. Unlike traditional low- or medium-interaction honeypots that rely on fixed responses and static virtual filesystems, LLMHoney combines a latency-optimized dictionary-based backend for common commands with on-demand LLM inference for novel or uncommon inputs, thereby enhancing realism and adaptability. The system architecture, model evaluation metrics, and empirical findings underscore the promise and challenges of LLM-driven honeypots as an emerging class of high-interaction deception technologies (Malhotra, 1 Sep 2025).
1. System Design and Architecture
LLMHoney comprises five principal modules—Configuration Loader, Network Listener, Authentication Manager, Session Handler with LLM Engine, and Logging & Storage. The system employs the AsyncSSH library for SSH transport and session management, while credentials, keys, and model selections are loaded from configuration. Attacker commands are processed by the Session Handler, which first checks for a cached output in a Python dictionary that models a virtual filesystem. If the command is not found, an LLM prompt is constructed—incorporating session state such as current directory and user-generated files—and routed to the selected LLM backend via LangChain (supporting Ollama and Google LLM APIs) for dynamic response generation. Resulting outputs are streamed back to the attacker and any file system side effects are updated in the cache. All interaction data—including command, response, latency, and memory usage—is logged into rotating logs and an SQLite session database.
2. Command Routing and Virtual Filesystem
Command handling in LLMHoney employs an explicit dichotomy: latency-critical and common commands (e.g., ls, pwd, whoami, cat /etc/passwd) are served directly from a local dictionary cache, delivering instantaneous responses and emulating standard shell behavior. All other commands, including those with unseen flags, rare utilities, or attacker-induced side effects, are processed by the LLM. The virtual filesystem is thus implemented as a Python dictionary keyed by complete command strings with corresponding output values. Updates to session state—such as file creation, deletion, or environment mutations—are tracked and injected into subsequent LLM prompts to preserve context and illusion of a persistent environment. Additional filters on LLM outputs, such as post-hoc sanitization or "command not found" fallbacks, are planned for future work to mitigate hallucinated responses.
3. LLM Backend Integration and Evaluation
LLMHoney systematically benchmarks thirteen LLM backends ranging from foundational open-source models (SMoLLM2-360M, Gemma3-1B, Qwen2.5, Phi3-3.8B) to the proprietary Gemini-2.0. Model selection is governed by trade-offs between output fidelity, hallucination rates, latency, and memory usage.
Performance Metrics:
- Latency and Memory Usage: Small models (≤1B parameters) yield sub-second mean latencies (e.g., Gemma3-1B at 488 ms, 0.4 MB/call), but underperform on accuracy.
- String-Level and Semantic Fidelity: Measured via exact-match rate, BLEU-4 score, cosine similarity on TF-IDF embeddings, Jaro–Winkler similarity, and normalized Levenshtein distance. For Gemini-2.0, BLEU=0.245, Cosine=0.405, Jaro-Winkler=0.711, hallucination rate=12.9% (summarized in the table below).
| Model | Latency (ms) | BLEU | Hallucination % | Cosine | Jaro–Winkler |
|---|---|---|---|---|---|
| Gemma3-1B | 488 | 0.027 | 25.9 | 0.170 | 0.517 |
| Qwen 2.5 | 1979 | 0.085 | 20.9 | 0.235 | 0.578 |
| Gemini-2.0 | 3172 | 0.245 | 12.9 | 0.405 | 0.711 |
| Phi3-3.8B | 3596 | 0.061 | 5.8 | 0.209 | 0.611 |
Key Observations:
- Proprietary Gemini-2.0 achieves the highest fidelity and context-awareness but incurs a 3.2 s mean response time.
- Open-source models, particularly Qwen 2.5 and Phi3-3.8B, offer an accuracy–performance sweet spot for live deployment (latency in the 2–4 s, cosine similarity >0.2).
- Small models are unsuitable for realism due to frequent "out-of-character" or incorrect outputs.
4. Evaluation Metrics and Realism Quantification
LLMHoney employs multiple quantitative metrics for assessing output realism:
- Exact-Match Rate: Proportion of LLMHoney outputs matching ground-truth terminal output.
- BLEU-4 Score: N-gram overlap to evaluate response fluency.
- Cosine Similarity: Semantic alignment of TF-IDF vectorized outputs.
- Jaro–Winkler and Levenshtein: Character-level distance/similarity.
- Hallucination Rate: Proportion of responses containing fabricated or implausible outputs.
Overall success criteria define a generation as "successful" if cosine similarity >0.4 or Jaro–Winkler >0.4. Experimental evaluation on a 138-command subset confirms the correlation between model capacity and fidelity, with latency as the limiting factor for deployment on commodity hardware.
5. Comparison with Static and Traditional Honeypots
LLMHoney supersedes traditional static-response honeypots (e.g., Cowrie) in multiple dimensions:
- Coverage: LLM-driven generation enables plausible outputs even for previously unseen commands or workflows, whereas static honeypots fail outside a curated set.
- Adaptivity: Session state management ensures consistency across file and environment modifications.
- Anti-Fingerprinting: The combinatorial output space and dynamic session context complicate detection by skilled attackers, unlike easily scriptable static honeypots.
However, LLMHoney imposes higher computational overhead (5–20× per interaction) and is constrained by LLM inference time and resource usage—particularly with larger models requiring 2–12 MB per call.
6. Limitations and Prospective Directions
Major constraints in LLMHoney's architecture include nontrivial hallucination rates—manifesting as invented files, commands, or system characteristics—especially with insufficiently regularized or small models. Context window limitations can disrupt coherence during long or complex attack workflows. Resource utilization remains an obstacle for edge or high-concurrency deployment scenarios. Proposed and ongoing research avenues include:
- Model Compression and Quantization: For lower latency and memory footprint.
- Hybrid Caching: Fine-tuning smaller models on attacker logs and precomputing frequent LLM outputs.
- Advanced Hallucination Detection: Incorporation of discriminator LLMs or context-aware output filters.
- Extended Context Management: Broader tracking of session artifacts and longer context windows for improved realism.
- Adversarial Testing: Real-world deployment to test resilience against prompt injection, timing analysis, and advanced attacker adaptation.
7. Security Implications and Research Impact
LLMHoney demonstrates that integration of LLMs into honeypot systems can substantially increase attack surface coverage, stateful interactivity, and transcriptive value of captured attacker data for downstream threat analysis. A plausible implication is that, as open-source LLMs improve in efficiency and scalability, dynamic honeypot architectures may supplant static approaches in environments requiring deception, adversarial engagement, and rapid intelligence collection. Persisting challenges, notably in robust hallucination management and resource efficiency, will be critical for operational deployments and broad adoption (Malhotra, 1 Sep 2025).