Illicit Darkweb Classification via Natural-language Processing: Classifying Illicit Content of Webpages based on Textual Information (2312.04944v1)
Abstract: This work aims at expanding previous works done in the context of illegal activities classification, performing three different steps. First, we created a heterogeneous dataset of 113995 onion sites and dark marketplaces. Then, we compared pre-trained transferable models, i.e., ULMFit (Universal LLM Fine-tuning), Bert (Bidirectional Encoder Representations from Transformers), and RoBERTa (Robustly optimized BERT approach) with a traditional text classification approach like LSTM (Long short-term memory) neural networks. Finally, we developed two illegal activities classification approaches, one for illicit content on the Dark Web and one for identifying the specific types of drugs. Results show that Bert obtained the best approach, classifying the dark web's general content and the types of Drugs with 96.08% and 91.98% of accuracy.
- Torank: Identifying the most influential suspicious domains in the tor network. Expert Systems with Applications, 123.
- Classifying illegal activities on tor network based on web textual contents. In Conference of the European Chapter of the Association for Computational Linguistics, pages 35–43.
- Appendix (2022). Illicit darkweb classification via natural-language processing: Classifying illicit content of webpages based on textual information https://figshare.com/s/54a17898301e2c9f7ca9.
- The economic functioning of online drugs markets. Journal of Economic Behavior & Organization.
- Natural language processing with Python: analyzing text with the natural language toolkit.
- Content and popularity analysis of tor hidden services. In IEEE 34th International Conference on Distributed Computing Systems Workshops (ICDCSW), pages 188–193.
- Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993–1022.
- Cybercrime threat intelligence: A systematic multi-vocal literature review. Computers & Security, 105:102258.
- Tor marketplaces exploratory data analysis: The drugs case. In Global Security, Safety and Sustainability - The Security Challenges of the Connected World, pages 218–229.
- Text analysis for detecting terrorism-related articles on the web. Journal of Network and Computer Applications, 38:16–21.
- Criminal motivation on the dark web: A categorisation model for law enforcement. Digital Investigation.
- Six years later: Analyzing online black markets involved in herbal cannabis drug dealing in the united states. Contemporary Drug Problems, 45(4):366–381.
- Engebretson, P. (2013). Chapter 2 - reconnaissance. In Engebretson, P., editor, The Basics of Hacking and Penetration Testing (Second Edition), pages 19 – 51. Syngress.
- Automatic product categorization for anonymous marketplaces.
- Hajba, G. L. (2018). Using beautiful soup. In Website Scraping with Python, pages 41–96. Springer.
- Mansfield-Devine, S. (2014). Tor under attack. Computer Fraud & Security, 2014(8):15 – 18.
- Minnaar, A. (2017). Online ‘underground’ marketplaces for illicit drugs: The prototype case of the dark web website ‘silk road’. Acta Criminologica : African Journal of Criminology & Victimology, page 2017.
- Hybridized term-weighting method for dark web classification. Neurocomputing, 173.
- Authorship analysis on dark marketplace forums. In 2015 European Intelligence and Security Informatics Conference, pages 1–8.
- Characterizing activity on the deep and dark web. In WWW ’19.
- Tsai, C.-F. (2012). Bag-of-words representation in image annotation: A review. International Scholarly Research Notices, 2012.
- The impact of preprocessing on text classification. Information Processing & Management, 50(1):104–112.
- Preprocessing techniques for text mining-an overview. International Journal of Computer Science & Communication Networks, 5(1):7–16.
- Discovering topics from dark websites. In 2009 IEEE Symposium on Computational Intelligence in Cyber Security, pages 175 – 179.
- An improved tf-idf approach for text classification. Journal of Zhejiang University-Science A, 6(1):49–55.