A Big Data Architecture for Early Identification and Categorization of Dark Web Sites (2401.13320v1)
Abstract: The dark web has become notorious for its association with illicit activities and there is a growing need for systems to automate the monitoring of this space. This paper proposes an end-to-end scalable architecture for the early identification of new Tor sites and the daily analysis of their content. The solution is built using an Open Source Big Data stack for data serving with Kubernetes, Kafka, Kubeflow, and MinIO, continuously discovering onion addresses in different sources (threat intelligence, code repositories, web-Tor gateways, and Tor repositories), downloading the HTML from Tor and deduplicating the content using MinHash LSH, and categorizing with the BERTopic modeling (SBERT embedding, UMAP dimensionality reduction, HDBSCAN document clustering and c-TF-IDF topic keywords). In 93 days, the system identified 80,049 onion services and characterized 90% of them, addressing the challenge of Tor volatility. A disproportionate amount of repeated content is found, with only 6.1% unique sites. From the HTML files of the dark sites, 31 different low-topics are extracted, manually labeled, and grouped into 11 high-level topics. The five most popular included sexual and violent content, repositories, search engines, carding, cryptocurrencies, and marketplaces. During the experiments, we identified 14 sites with 13,946 clones that shared a suspiciously similar mirroring rate per day, suggesting an extensive common phishing network. Among the related works, this study is the most representative characterization of onion services based on topics to date.
- J. Pastor-Galindo, P. Nespoli, F. Gómez Mármol, and G. Martínez Pérez, “The not yet exploited goldmine of osint: Opportunities, open challenges and future trends,” IEEE Access, vol. 8, pp. 10 282–10 304, 2020.
- M. Willett, “The cyber dimension of the russia–ukraine war,” Survival, vol. 64, no. 5, pp. 7–26, 2022.
- D. L. Huete Trujillo and A. Ruiz-Martínez, “Tor hidden services: A systematic literature review,” Journal of Cybersecurity and Privacy, vol. 1, no. 3, pp. 496–518, 2021.
- J. M. Ruiz Ródenas, J. Pastor-Galindo, and F. Gómez Mármol, “A general and modular framework for dark web analysis,” Cluster Computing, pp. 1–17, 2023.
- J. Pastor-Galindo, R. Sáez Ruiz, J. Maestre Vidal, M. Sotelo Monge, F. Gómez Mármol, and G. Martínez Pérez, “Designing a platform for discovering tor onion services,” in 7th National Conference on Cybersecurity Research (JNIC 2022), Bilbao, Spain, 2022.
- G. Owenson, S. Cortes, and A. Lewman, “The darknet’s smaller than we thought: The life cycle of tor hidden services,” Digital Investigation, vol. 27, pp. 17–22, 2018.
- F. Platzer and A. Lux, “A Synopsis of Critical Aspects for Darknet Research,” ACM International Conference Proceeding Series, no. 20, 2022.
- J. Pastor-Galindo, F. Gómez Mármol, and G. Martínez Pérez, “On the gathering of tor onion addresses,” Future Generation Computer Systems, vol. 145, pp. 12–26, 2023.
- C. Yoon, K. Kim, Y. Kim, S. Shin, and S. Son, “Doppelgängers on the dark web: A large-scale assessment on phishing hidden web services,” in The Web Conference 2019 - Proceedings of the World Wide Web Conference, WWW 2019, 2019, pp. 2225–2235.
- P. Burda, C. Boot, and L. Allodi, “Characterizing the Redundancy of DarkWeb .Onion Services,” in Proceedings of the 14th International Conference on Availability, Reliability and Security, ser. ARES ’19. New York, NY, USA: Association for Computing Machinery, 2019.
- M. Steinebach, M. Schäfer, A. Karakuz, K. Brandl, and Y. Yannikos, “Detection and Analysis of Tor Onion Services,” in Proceedings of the 14th International Conference on Availability, Reliability and Security, ser. ARES ’19. New York, NY, USA: Association for Computing Machinery, 2019.
- D. Khurana, A. Koli, K. Khatter, and S. Singh, “Natural language processing: State of the art, current trends and challenges,” Multimedia tools and applications, vol. 82, no. 3, pp. 3713–3744, 2023.
- P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig, “Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing,” ACM Comput. Surv., vol. 55, no. 9, jan 2023.
- A. T. Zulkarnine, R. Frank, B. Monk, J. Mitchell, and G. Davies, “Surfacing collaborated networks in dark web to find illicit and criminal content,” in 2016 IEEE Conference on Intelligence and Security Informatics (ISI), 2016, pp. 109–114.
- S. Ghosh, A. Das, P. Porras, V. Yegneswaran, and A. Gehani, “Automated categorization of onion sites for analyzing the darkweb ecosystem,” in Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, vol. Part F1296, 2017, pp. 1793–1802.
- A. Celestini and S. Guarino, “Design, Implementation and Test of a Flexible Tor-Oriented Web Mining Toolkit,” ACM International Conference Proceeding Series, vol. Part F1294, no. August 2018, 2017.
- G. Cherubin, J. Hayes, and M. Juarez, “Website Fingerprinting Defenses at the Application Layer,” Proceedings on Privacy Enhancing Technologies, vol. 2017, no. 2, pp. 186–203, 2017.
- J. Park, H. Mun, and Y. Lee, “Improving Tor Hidden Service Crawler Performance,” in 2018 IEEE Conference on Dependable and Secure Computing (DSC), 2018, pp. 1–8.
- X. Zhang and K. P. Chow, “A Framework for Dark Web Threat Intelligence Analysis,” in Cyber Warfare and Terrorism: Concepts, Methodologies, Tools, and Applications, I. R. M. Association, Ed. Hershey, PA, USA: IGI Global, 2020, pp. 266–276.
- J. Lee, Y. Hong, H. Kwon, and J. Hur, “Shedding Light on Dark Korea: An In-Depth Analysis and Profiling of the Dark Web in Korea,” in Information Security Applications, I. You, Ed. Cham: Springer International Publishing, 2020, pp. 357–369.
- S. M. M. Monterrubio, J. E. A. Naranjo, L. I. B. Lopez, and A. L. V. Caraguay, “Black Widow Crawler for TOR network to search for criminal patterns,” Proceedings - 2021 2nd International Conference on Information Systems and Software Technologies, ICI2ST 2021, pp. 108–113, 2021.
- A. H. M. Alaidi, R. M. Alairaji, H. T. H. S. ALRikabi, I. A. Aljazaery, and S. H. Abbood, “Dark Web Illegal Activities Crawling and Classifying Using Data Mining Techniques,” International Journal of Interactive Mobile Technologies, vol. 16, no. 10, pp. 122 – 139, 2022.
- M. Bernaschi, A. Celestini, S. Guarino, F. Lombardi, and E. Mastrostefano, “Spiders like Onions: On the network of tor hidden services,” The Web Conference 2019 - Proceedings of the World Wide Web Conference, WWW 2019, pp. 105–115, 2019.
- M. Spitters, S. Verbruggen, and M. Van Staalduinen, “Towards a Comprehensive Insight into the Thematic Organization of the Tor Hidden Services,” in 2014 IEEE Joint Intelligence and Security Informatics Conference, 2014, pp. 220–223.
- I. Sanchez-Rola, D. Balzarotti, and I. Santos, “The Onions Have Eyes: A Comprehensive Structure and Privacy Analysis of Tor Hidden Services,” in Proceedings of the 26th International Conference on World Wide Web, ser. WWW ’17. Republic and Canton of Geneva, CHE: International World Wide Web Conferences Steering Committee, 2017, pp. 1251–1260.
- M. W. Al Nabki, E. Fidalgo, E. Alegre, and I. de Paz, “Classifying Illegal Activities on Tor Network Based on Web Textual Contents,” in Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers. Valencia, Spain: Association for Computational Linguistics, apr 2017, pp. 35–43.
- M. W. Al-Nabki, E. Fidalgo, E. Alegre, and L. Fernández-Robles, “ToRank: Identifying the most influential suspicious domains in the Tor network,” Expert Systems with Applications, vol. 123, pp. 212–226, 2019.
- M. Faizan and R. A. Khan, “Exploring and analyzing the dark Web: A new alchemy,” First Monday, vol. 24, no. 5, 2019.
- J. Dalins, C. Wilson, and M. Carman, “Criminal motivation on the dark web: A categorisation model for law enforcement,” Digital Investigation, vol. 24, pp. 62–71, 2018.
- F. Barr-Smith and J. Wright, “Phishing With A Darknet: Imitation of Onion Services,” in 2020 APWG Symposium on Electronic Crime Research (eCrime), 2020, pp. 1–13.
- F. Brenner, F. Platzer, and M. Steinebach, “Discovery of single-vendor marketplace operators in the tor-network,” in Proceedings of the 16th International Conference on Availability, Reliability and Security, ser. ARES 21. New York, NY, USA: Association for Computing Machinery, 2021.
- M. Steinebach, S. Zenglein, and K. Brandl, “Phishing detection on tor hidden services,” Forensic Science International: Digital Investigation, vol. 36, p. 301117, 2021.
- C. Guitton, “A review of the available content on Tor hidden services: The case against further development,” Computers in Human Behavior, vol. 29, no. 6, pp. 2805–2815, 2013.
- G. Owen and N. Savage, “Empirical analysis of Tor hidden services,” IET Information Security, vol. 10, no. 3, pp. 113–118, 2016.
- V. Nair and J. M. Kannimoola, “A Tool to Extract Onion Links from Tor Hidden Services and Identify Illegal Activities,” in Inventive Computation and Information Technologies, S. Smys, V. E. Balas, and R. Palanisamy, Eds. Singapore: Springer Nature Singapore, 2022, pp. 29–37.
- S. Takaaki and I. Atsuo, “Dark Web Content Analysis and Visualization,” in Proceedings of the ACM International Workshop on Security and Privacy Analytics, ser. IWSPA ’19. New York, NY, USA: Association for Computing Machinery, 2019, pp. 53–59.
- M. Zabihimayvan and D. Doran, “A first look at references from the dark to the surface web world: a case study in Tor,” International Journal of Information Security, vol. 21, no. 4, pp. 739–755, 2022.
- Y. Kawaguchi and S. Ozawa, “Exploring and Identifying Malicious Sites in Dark Web Using Machine Learning,” in Neural Information Processing, T. Gedeon, K. W. Wong, and M. Lee, Eds. Cham: Springer International Publishing, 2019, pp. 319–327.
- A. Biryukov, I. Pustogarov, F. Thill, and R.-P. Weinmann, “Content and Popularity Analysis of Tor Hidden Services,” in 2014 IEEE 34th International Conference on Distributed Computing Systems Workshops (ICDCSW), 2014, pp. 188–193.
- H. Zhao, D. Phung, V. Huynh, Y. Jin, L. Du, and W. Buntine, “Topic modelling meets deep neural networks: A survey,” in Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21, Z.-H. Zhou, Ed. International Joint Conferences on Artificial Intelligence Organization, 8 2021, pp. 4713–4720, survey Track.
- M. Shi, J. Liu, D. Zhou, M. Tang, and B. Cao, “We-lda: A word embeddings augmented lda model for web services clustering,” in 2017 IEEE International Conference on Web Services (ICWS), 2017, pp. 9–16.
- A. B. Dieng, F. J. R. Ruiz, and D. M. Blei, “Topic modeling in embedding spaces,” Transactions of the Association for Computational Linguistics, vol. 8, pp. 439–453, 2020.
- D. Angelov, “Top2vec: Distributed representations of topics,” CoRR, vol. abs/2008.09470, 2020.
- M. Grootendorst, “Bertopic: Neural topic modeling with a class-based tf-idf procedure,” arXiv preprint arXiv:2203.05794, 2022.
- H. W. A. Hanley, D. Kumar, and Z. Durumeric, “Happenstance: Utilizing semantic search to track russian state media narratives about the russo-ukrainian war on reddit,” Proceedings of the International AAAI Conference on Web and Social Media, vol. 17, no. 1, pp. 327–338, Jun. 2023.
- R. Egger and J. Yu, “A topic modeling comparison between lda, nmf, top2vec, and bertopic to demystify twitter posts,” Frontiers in sociology, vol. 7, 2022.
- K. Li, P. Liu, Q. Tan, J. Shi, Y. Gao, and X. Wang, “Out-of-Band Discovery and Evaluation for Tor Hidden Services,” in Proceedings of the 31st Annual ACM Symposium on Applied Computing, ser. SAC ’16. New York, NY, USA: Association for Computing Machinery, 2016, pp. 2057–2062.
- A. Barbaresi, “Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction,” in Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations. Association for Computational Linguistics, 2021, pp. 122–131.
- A. Gionis, P. Indyk, R. Motwani et al., “Similarity search in high dimensions via hashing,” in Vldb, vol. 99, no. 6, 1999, pp. 518–529.
- W. H. Gomaa, A. A. Fahmy et al., “A survey of text similarity approaches,” international journal of Computer Applications, vol. 68, no. 13, pp. 13–18, 2013.
- A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov, “Bag of tricks for efficient text classification,” arXiv preprint arXiv:1607.01759, 2016.
- N. Reimers and I. Gurevych, “Sentence-BERT: Sentence embeddings using Siamese BERT-networks,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics, Nov. 2019, pp. 3982–3992.
- E. Becht, L. McInnes, J. Healy, C.-A. Dutertre, I. W. Kwok, L. G. Ng, F. Ginhoux, and E. W. Newell, “Dimensionality reduction for visualizing single-cell data using umap,” Nature biotechnology, vol. 37, no. 1, pp. 38–44, 2019.
- L. McInnes, J. Healy, and S. Astels, “hdbscan: Hierarchical density based clustering.” J. Open Source Softw., vol. 2, no. 11, p. 205, 2017.
- A. Özgür, L. Özgür, and T. Güngör, “Text categorization with class-based and corpus-based keyword selection,” in Computer and Information Sciences-ISCIS 2005: 20th International Symposium, Istanbul, Turkey, October 26-28, 2005. Proceedings 20. Springer, 2005, pp. 606–615.