Large Language Models are Unreliable for Cyber Threat Intelligence (2503.23175v1)
Abstract: Several recent works have argued that LLMs can be used to tame the data deluge in the cybersecurity field, by improving the automation of Cyber Threat Intelligence (CTI) tasks. This work presents an evaluation methodology that other than allowing to test LLMs on CTI tasks when using zero-shot learning, few-shot learning and fine-tuning, also allows to quantify their consistency and their confidence level. We run experiments with three state-of-the-art LLMs and a dataset of 350 threat intelligence reports and present new evidence of potential security risks in relying on LLMs for CTI. We show how LLMs cannot guarantee sufficient performance on real-size reports while also being inconsistent and overconfident. Few-shot learning and fine-tuning only partially improve the results, thus posing doubts about the possibility of using LLMs for CTI scenarios, where labelled datasets are lacking and where confidence is a fundamental factor.
Summary
- The paper reveals that LLMs perform poorly on real CTI reports, missing over 20% of crucial entities in extraction and generation tasks.
- LLMs exhibit significant inconsistency by providing varying outputs for identical inputs, which undermines trust in automated CTI pipelines.
- The study highlights that LLMs are poorly calibrated, with confidence scores misaligned with true accuracy, posing risks in threat intelligence applications.
This paper, "LLMs are Unreliable for Cyber Threat Intelligence" (2503.23175), investigates the practical reliability of state-of-the-art LLMs for real-world Cyber Threat Intelligence (CTI) tasks. The authors challenge recent optimistic claims about LLMs in CTI by evaluating their performance, consistency, and calibration on a dataset of full-length CTI reports, which are significantly longer and more complex than the text snippets often used in prior research.
The core problem addressed is the potential security risk organizations face if they blindly rely on LLMs for automating CTI processing without proper evaluation. The paper highlights that while LLMs might perform well on short, synthetic examples, real CTI reports contain extensive information, including noise and ambiguities, which can confuse LLMs.
The paper focuses on two practical CTI tasks:
- Information Extraction: Extracting structured entities (like APT names, campaign dates, CVEs, attack vectors) from unstructured CTI reports. This is crucial for tasks like populating CTI databases or generating STIX objects.
- Information Generation: Generating a structured profile for an APT based on its name and description (including goals, labels, country of origin, associated CVEs, and attack vectors). This is relevant for building CTI chatbots or automated threat profile summaries.
To evaluate the LLMs, the authors propose a five-step pipeline:
- Evaluate performance using zero-shot learning.
- Evaluate performance using few-shot learning.
- Evaluate performance after fine-tuning (Supervised Fine-Tuning - SFT).
- Quantify performance consistency by calculating confidence intervals (CIs) based on repeated prompts with the same input.
- Analyze confidence calibration using Expected Calibration Error (ECE) and Brier Score (BS), derived from token log probabilities.
The experiments were conducted using the APT dataset by Di Tizio et al. [di2022software], comprising 350 real CTI reports (average 3009 words) on MITRE APTs. This dataset is significantly larger than those used in many prior studies. The evaluation used state-of-the-art LLMs: gpt4o (OpenAI), gemini-1.5-pro-latest (Google), and mistral-large-2 (Mistral), selected for their large context windows, fine-tuning capabilities, and JSON output support. Prompt engineering techniques like role specification, step specification, input subdivision, and world closing were applied.
Key Findings and Practical Implications:
- Performance on Real Reports (RQ1): LLMs showed low performance (Precision and Recall) on real-size reports compared to claims based on short texts. For information extraction, recall was often below 80% for critical entities like campaigns and CVEs, meaning over 20% of vulnerabilities or campaign details could be missed (False Negatives). For information generation, recall was particularly low for entities like APT labels (as low as 0.02) and CVEs (as low as 0.06), making LLMs unreliable for building accurate APT profiles.
- Implementation Impact: Relying solely on LLMs for automated CTI extraction or generation pipelines could lead to significant gaps in intelligence, potentially causing organizations to overlook critical threats, fail to patch vulnerabilities, or misattribute attacks.
- Few-shot & Fine-tuning: Surprisingly, few-shot learning and fine-tuning often decreased performance for several entities in both tasks, suggesting these common techniques might not effectively transfer knowledge or adapt models to the complexity of real CTI reports in these scenarios. This challenges the assumption that standard adaptation methods guarantee improvement in this domain.
- Consistency of Output (RQ2): The LLMs demonstrated a lack of consistency, providing different outputs when prompted multiple times with the identical input, even with temperature=0 and a fixed seed.
- Implementation Impact: This non-determinism poses significant risks. If an automated system extracts different CVEs or attack vectors from the same report on subsequent analyses, it creates uncertainty for downstream processes like patch management or defensive configuration. This inconsistency makes it hard to trust repeated queries for critical intelligence. The consistency issue was more pronounced in information generation tasks.
- Calibration of Confidence (RQ3): LLMs were found to be poorly calibrated for CTI tasks, meaning their expressed confidence level does not accurately reflect the true probability of their predictions being correct (high ECE and BS values).
- Implementation Impact: In real-world CTI scenarios where labeled ground truth for evaluation might be scarce, practitioners often rely on model confidence to decide whether to trust a prediction. Poor calibration means a highly confident prediction might be wrong (overconfidence), leading to false positives and incorrect actions, or a low-confidence prediction might be correct (underconfidence), causing correct intelligence to be discarded (false negatives). This is particularly concerning for entities like CVEs and attack vectors, where calibration was especially poor after fine-tuning.
Implementation Considerations and Limitations:
- Processing real CTI reports requires LLMs with large context windows (like the models tested), which have higher computational requirements and costs.
- The poor performance, inconsistency, and miscalibration observed suggest that current LLMs are not suitable for direct, unsupervised deployment in automated CTI pipelines where high accuracy and reliability are paramount.
- Standard prompt engineering, few-shot learning, and fine-tuning techniques do not appear to be sufficient solutions for these limitations in the CTI domain, possibly due to the inherent complexity and ambiguity of real-world threat intelligence text.
- The paper was limited to one dataset and a small number of LLMs. Evaluating on more diverse datasets and models is necessary to generalize these findings.
- Quantifying consistency via re-prompting is computationally expensive, especially for obtaining precise confidence intervals, presenting a cost-precision trade-off for practical evaluation.
- Calibration analysis was limited by the ability to extract log probabilities from closed-source models (only possible with gpt4o in this paper), hindering a comparative analysis of calibration across all models.
In conclusion, the paper provides strong evidence that, despite promising results on simplified tasks, current state-of-the-art LLMs are unreliable for practical CTI work involving real-size reports. Their limitations in performance, consistency, and calibration introduce significant security risks by potentially providing inaccurate, contradictory, or untrustworthy intelligence. For implementing AI in CTI, practitioners should be highly cautious and consider alternative or augmented approaches.
Future research suggested by the authors includes exploring other datasets, evaluating more LLMs, experimenting with advanced techniques like Chain-of-Thought (CoT) prompting or Retrieval Augmented Generation (RAG) which can leverage external knowledge sources, and investigating multi-agent systems involving multiple LLMs. Improving consistency measurement and integrating formal analysis with empirical studies are also highlighted as important future steps.
Related Papers
Tweets
YouTube
HackerNews
- Large Language Models Are Unreliable for Cyber Threat Intelligence (5 points, 0 comments)
- Large Language Models are Unreliable for Cyber Threat Intelligence (17 points, 1 comment)