CyberSleuth: Automated Web Attack Forensics

Updated 2 September 2025

CyberSleuth is an autonomous, LLM-driven forensic system that automates web attack investigations through multi-agent analysis of packet data and logs.
Its modular design deploys specialized sub-agents for tasks like packet parsing and flow summarization, ensuring efficient processing of raw forensic evidence.
Evaluated on controlled scenarios, CyberSleuth demonstrates high accuracy in CVE attribution and produces actionable forensic reports for incident response.

CyberSleuth is an autonomous, LLM-based agent system designed for blue-team forensics in web application attacks. It automates forensic investigation by processing packet-level traces and application logs to identify compromised services, exploited vulnerabilities (down to the CVE identifier), and the success or failure of individual attacks. CyberSleuth’s architecture, benchmarking, and evaluation establish it as a high-utility framework for forensic incident response, delivering interpretable, stepwise analysis directly from raw evidence files. The CyberSleuth platform and evaluation suite are publicly available, supporting reproducibility and research progress in LLM-driven blue-team automation (Fumero et al., 28 Aug 2025).

1. Agent System Architecture and Memory Design

CyberSleuth is structured as a multi-agent system, with a main reasoning agent delegating specialized tasks to sub-agents, each equipped with their own tool integration. The key sub-agents include the "tshark expert," which interfaces with low-level packet analysis tools (such as tshark for detailed PCAP interrogation), and the "Flow Summariser," which ingests connection payloads and synthesizes concise forensic reports from network flows and log data. This separation of duties reduces reasoning burden, enhances modularity, and mimics expert analyst workflows.

A distinguishing architectural element is the adoption of MemGPT-style memory management to support stateful, multi-step investigations without exhausting the LLM context window. The architecture stores memory as a vector database comprising three distinct segments:

System Instructions: Static definitions of agent roles and tool APIs.
Working Context: Retrieved contextually relevant memory segments for ongoing reasoning.
FIFO Queue: A rolling buffer holding the most recent reasoning traces and responses, budgeted by tokens.

To optimize token allocation, a square-root weighted formula is utilized:

$\text{allocation}_i = \text{budget} \times \left( \frac{\sqrt{\text{tokens}_i}}{\sum_j \sqrt{\text{tokens}_j}} \right); \quad \text{final\_allocation}_i = \min(\text{tokens}_i, \text{allocation}_i)$

This scheduling prevents large logs from dominating context while ensuring crucial forensic evidence remains accessible throughout multi-step analysis.

2. Analytical Pipeline and Reasoning Workflow

The CyberSleuth pipeline consists of three coordinated stages: preprocessing (evidence ingestion), stepwise reasoning/reporting, and synthesis of the final forensic report. The system’s operation is outlined as follows:

Ingestion and Parsing: PCAP files and associated logs are parsed using scripting interfaces (e.g., tshark), generating structured summaries of TCP/IP flows and key HTTP/HTTPS requests.
Iterative Reasoning: Specialized sub-agents handle source identification, protocol inference, summary generation, and CVE matching. The reasoning agent issues web searches using live intelligence (Web Search Tool) to retrieve up-to-date CVE and vulnerability data—crucial for precise attribution in current threat landscapes.
Attack Attribution: By correlating header banners (e.g., Apache HTTPD versions), observed exploit payloads, and web search findings, the agent narrows down targets and exploits. The process proceeds until the agent justifies a concrete determination of the targeted service, CVE exploited, and attack success.
Report Generation: Using collated intermediate reasoning and web intelligence, the system produces a detailed forensic report, outlining the investigative path and justifying attribution decisions.

A critical architectural decision is simple orchestration: complex inter-agent communication is avoided in favor of main-agent-driven query/response rounds. This reduces coordination deadlocks and encourages interpretability and stability.

3. Benchmarking and Comparative Evaluation

CyberSleuth is systematically evaluated using a multi-part benchmark suite:

20 controlled web attack scenarios (attacks on Apache, Nginx, VPN gateways, CMS apps), with increasing complexity and ambiguity, simulate real-world analyst challenges.
10 additional incident traces from the security landscape of 2025, to test adaptability on novel exploits.

The evaluation investigates four agent designs:

Single Agent (SA): Unified reasoning agent, performs all tasks in sequence;
Tshark Expert Agent (TEA): Delegates packet parsing to specialist;
Flow Reporter Agent (FRA): Isolates flow summarization to a sub-agent, decoupling main reasoning;
Log-Integrated Variant: Incorporates application/server logs.

Metrics include targeted service identification accuracy (SA: 66%, FRA: 90%), CVE attribution rates (FRA: up to 80% on recent cases), and attack success labeling (measured by accuracy, F1 score, MCC; FRA up to 0.45 MCC). Cost and token efficiency, measured by step count per incident, is improved in the modular (FRA) architecture, typically converging in five steps.

4. Integration with LLM Backends and Open Source Models

CyberSleuth is backend-agnostic, supporting deployment on multiple LLM engines. The paper benchmarks six models, including proprietary (GPT-4o, GPT-5) and open-source (DeepSeek R1) backends. Notably, DeepSeek R1 achieves near-parity with proprietary LLMs, with some human experts expressing a slight subjective preference for its forensic report outputs. This finding supports open, reproducible research in blue-team LLM applications.

Open-sourcing both the CyberSleuth platform and its scenario benchmark enables fair comparisons by external researchers and practitioners, accelerating progress in LLM-driven defense automation.

5. Human Expert Validation

A blinded human study with 22 expert analysts evaluated CyberSleuth’s reports across three critical dimensions: completeness, usefulness, and logical coherence (0–5 scale). Across the scenarios, the system’s outputs consistently scored above 4.3 on average, and were judged actionable for incident response. Experts highlighted the clarity of intermediate steps and rationale, aided by explicit logging (e.g., web queries, sub-agent reports).

While a slight preference was noted for DeepSeek R1’s outputs, overall expert ratings confirmed that CyberSleuth’s design produces operationally relevant and trustworthy forensic conclusions.

6. Design Decisions and Guidance for Practitioners

The systematic evaluation underscores several actionable findings for practitioners designing forensic LLM agents:

Sub-agent Decoupling: Partitioning packet parsing and flow summarization into independent sub-agents (as in FRA) improves both reasoning focus and context budget utilization.
Memory Management: MemGPT-style long-term memory, with dynamic working context and FIFO queue, is critical for scaling analysis to large-volume evidence traces.
Web Integration: Live, targeted web search is essential for CVE discovery and vulnerability attribution, compensating for rapid ecosystem evolution not captured in training data.
Open-source Advantage: Open LLMs can match proprietary alternatives and reduce operational costs in enterprise blue-team deployments.

A plausible implication is that forensic LLM design should prioritize modularity, isolation of computationally intensive subtasks, and careful state management.

7. Significance and Future Directions

CyberSleuth represents a significant advance in the automation of forensic analysis for blue teams. By bridging packet-level analysis, CVE intelligence retrieval, and interpretable reasoning, it enables rapid, reliable web attack investigation with quantifiable accuracy. Its open-source platform and rigorous benchmarking framework serve as a research catalyst in both academia and incident response practice.

Continuing work focuses on scaling agent architectures to more complex, multi-modal evidence (e.g., multi-tenant log data, memory dumps), integrating adversarial robustness, and refining human–agent collaboration loops.

This article is based exclusively on the systematic design, benchmarking, and evaluation described in "CyberSleuth: Autonomous Blue-Team LLM Agent for Web Attack Forensics" (Fumero et al., 28 Aug 2025).

PDF Markdown Chat (Pro)

References (1)

CyberSleuth: Autonomous Blue-Team LLM Agent for Web Attack Forensics (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to CyberSleuth.