Bandit on the Hunt: Dynamic Crawling for Cyber Threat Intelligence

Published 25 Apr 2025 in cs.CR | (2504.18375v1)

Abstract: Public information contains valuable Cyber Threat Intelligence (CTI) that is used to prevent future attacks. While standards exist for sharing this information, much appears in non-standardized news articles or blogs. Monitoring online sources for threats is time-consuming and source selection is uncertain. Current research focuses on extracting Indicators of Compromise from known sources, rarely addressing new source identification. This paper proposes a CTI-focused crawler using multi-armed bandit (MAB) and various crawling strategies. It employs SBERT to identify relevant documents while dynamically adapting its crawling path. Our system ThreatCrawl achieves a harvest rate exceeding 25% and expands its seed by over 300% while maintaining topical focus. Additionally, the crawler identifies previously unknown but highly relevant overview pages, datasets, and domains.

Abstract PDF Upgrade to Chat

Summary

The paper introduces ThreatCrawl, which dynamically discovers new cyber threat sources using a UCB1-based Multi-Armed Bandit algorithm integrated with crawling, classification, and ranking techniques.
It employs SBERT embeddings and cosine similarity to assess page relevance from a priority queue initialized with seed URLs.
Experimental results demonstrate that ThreatCrawl achieves harvest rates up to 25.14%, identifies over 270 relevant domains, and expands its seed set by over 300%.

The paper "Bandit on the Hunt: Dynamic Crawling for Cyber Threat Intelligence" (2504.18375) proposes ThreatCrawl, a focused web crawling system designed to automatically discover new, relevant information sources for Cyber Threat Intelligence (CTI). The core problem addressed is the difficulty faced by security analysts in manually identifying and monitoring numerous online sources like blogs and news articles where CTI is often published in unstructured formats. While existing tools focus on extracting Indicators of Compromise (IOCs) from known sources, ThreatCrawl aims to actively find new sources related to a given set of initial CTI pages (seed URLs).

ThreatCrawl integrates several techniques into a single pipeline: crawling, relevance classification, and ranking. Its objective is to identify pages $P$ that are contextually similar ( $P \sim S$ ) to a user-provided set of seed pages $S$ . This similarity-based approach aligns with the shifting landscape of CTI relevance, which increasingly includes broader textual information like threat reports and malware analyses beyond just specific IOCs.

The system operates based on a user-provided set of seed URLs, which are initially crawled to serve as ground truth for relevance classification and to prepare the system, including initializing a Multi-Armed Bandit (MAB). A priority queue manages discovered pages, ordered by their relevance. Pages highly similar to the seed set are prioritized.

ThreatCrawl uses a Multi-Armed Bandit (MAB) algorithm, specifically UCB1, to dynamically select the most promising crawling "arm" or search action at each step. UCB1 is chosen for its balance of exploration and exploitation in relatively stable environments, which is suitable for leveraging domain knowledge from the seed set. The MAB helps the crawler decide which strategy is most likely to yield relevant pages next.

The system employs three distinct search actions:

Forward Link Search (F): Follows hyperlinks found on the current page. This is effective for exploring deeper content within a website.
Backward Link Search (B): Identifies pages that link to the current page. This helps discover related external sources, often used in SEO analysis. The paper notes reliance on a commercial API for this data.
Keyword Search (K): Extracts key terms from the current page (using tools like KeyBERT) and performs a search using these keywords to find other relevant pages. This action also utilizes a commercial API for the search itself.

Relevance classification is performed using Sentence-BERT (SBERT) embeddings. SBERT generates dense vector representations of page content, capturing semantic meaning more effectively than traditional methods like TF-IDF. The similarity between a crawled page's embedding and the embeddings of the seed pages (or the seed set as a whole) is calculated using cosine similarity. A page $p$ is considered relevant to the seed set $S$ if its maximum cosine similarity to any page in $S$ exceeds a defined relevance threshold $\theta$ .

The MAB's reward function for a given step (page) is calculated based on the number of new relevant domains discovered and the sum of similarities of the newly discovered relevant pages to the seed set. This guides the MAB to favor actions that lead to both high-relevance content and broader domain discovery.

The evaluation of ThreatCrawl demonstrated its effectiveness. Using a relevance threshold of 0.6 and a seed threshold of 0.8, experiments were conducted for up to 500 and 2000 steps using various combinations of the search actions (FBK, FB, FK, BK, F, B, K).

Key results from the 2000-step evaluation:

The combination of Backward Link Search and Keyword Search (TC-BK) achieved the highest harvest rate (relevant pages / total crawled pages) at 25.14%. This indicates a high precision in finding relevant content.
TC-BK also identified the highest number of relevant domains (270), suggesting its effectiveness in expanding the system's knowledge base beyond the initial seed domains.
Forward Search alone (TC-F) crawled the most relevant pages (4992) and achieved a harvest rate of 22.99%.
The combination of Forward Link Search and Keyword Search (TC-FK) achieved a harvest rate of 22.55% and identified 4119 relevant pages.
The performance significantly exceeded the reported harvest rates (~9.5%) of prior CTI-focused crawling work.
ThreatCrawl successfully identified previously unknown security overview pages, datasets, and news domains relevant to the CTI domain, demonstrating its ability to expand the seed set by over 300% while maintaining focus.

From a practical standpoint, ThreatCrawl provides a method for CTI analysts to automate the laborious task of identifying new sources. By starting with a small set of known relevant URLs, the system can intelligently explore the web, prioritizing pages and domains semantically similar to the initial scope. This frees up analyst time for more critical tasks like analysis and response.

However, the paper notes limitations. The reliance on search engine features for backward link and keyword search is vulnerable to API changes or removals. The evaluation was limited in runtime (up to 2000 steps), so long-term behavior and saturation points are not fully understood. The use of pre-trained SBERT, while effective, might be less accurate than larger, more domain-specific models, though larger models introduce higher computational and privacy costs.

Future work could involve dynamically adjusting relevance thresholds based on performance, integrating user feedback, using graph-based analysis to understand source relationships, and potentially incorporating techniques to identify content aggregators or original publishers. The adaptability of the system based on the seed suggests its potential applicability to other domains beyond CTI, which warrants further evaluation.