Fingerprinting web servers through Transformer-encoded HTTP response headers (2404.00056v1)
Abstract: We explored leveraging state-of-the-art deep learning, big data, and natural language processing to enhance the detection of vulnerable web server versions. Focusing on improving accuracy and specificity over rule-based systems, we conducted experiments by sending various ambiguous and non-standard HTTP requests to 4.77 million domains and capturing HTTP response status lines. We represented these status lines through training a BPE tokenizer and RoBERTa encoder for unsupervised masked LLMing. We then dimensionality reduced and concatenated encoded response lines to represent each domain's web server. A Random Forest and multilayer perceptron (MLP) classified these web servers, and achieved 0.94 and 0.96 macro F1-score, respectively, on detecting the five most popular origin web servers. The MLP achieved a weighted F1-score of 0.55 on classifying 347 major type and minor version pairs. Analysis indicates that our test cases are meaningful discriminants of web server types. Our approach demonstrates promise as a powerful and flexible alternative to rule-based systems.
- Abien Fred Agarap. 2018. Deep learning using rectified linear units (relu).
- Shadi Al-Hakimi and Freek Bax. 2021. Hunting for malicious infrastructure using big data. SNE Master Research Projects 2020 - 2021, University of Amsterdam.
- Automatic Certificate Management Environment (ACME). RFC 8555.
- Hypertext Transfer Protocol Version 2 (HTTP/2). RFC 7540.
- Longformer: The long-document transformer.
- Hypertext Transfer Protocol – HTTP/1.0. RFC 1945.
- Automated generation of web server fingerprints.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- Hypertext Transfer Protocol – HTTP/1.1. RFC 2068.
- Http2vec: Embedding of http requests for detection of anomalous traffic.
- Namig J. Guliyev and Vugar E. Ismailov. 2018. Approximation capability of two hidden layer feedforward neural networks with fixed weights. Neurocomputing, 316:262–269.
- Jeff Heaton. 2008. Introduction to Neural Networks for Java, 2nd Edition, 2nd edition. Heaton Research, Inc.
- Analyzing and summarizing the web server detection technology based on http. In 2015 6th IEEE International Conference on Software Engineering and Service Science (ICSESS), pages 1042–1045.
- Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization.
- Detection of malicious http requests using header and url features. In Proceedings of the Future Technologies Conference (FTC) 2020, Volume 2, pages 449–468, Cham. Springer International Publishing.
- Detecting and defending against web-server fingerprinting. In 18th Annual Computer Security Applications Conference, 2002. Proceedings., pages 321–330.
- Dustin William Lee. 2001. Hmap: A technique and tool for remote identification of http servers.
- The weighted word2vec paragraph vectors for anomaly detection over http traffic. IEEE Access, 8:141787–141798.
- Roberta: A robustly optimized bert pretraining approach.
- Gordon Fyodor Lyon. 2009. Nmap Network Scanning: The Official Nmap Project Guide to Network Discovery and Security Scanning. Insecure, Sunnyvale, CA, USA.
- Detecting malware-infected devices using the http header patterns. IEICE Transactions on Information and Systems, E101.D(5):1370–1379.
- Phishmon: A machine learning framework for detecting phishing webpages. In 2018 IEEE International Conference on Intelligence and Security Informatics (ISI), pages 220–225.
- Using xgboost to discover infected hosts based on http traffic. Security and Communication Networks, 2019:2182615.
- Tranco: A research-oriented top sites ranking hardened against manipulation. In Proceedings 2019 Network and Distributed System Security Symposium. Internet Society.
- Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
- Marc Ruef. 2007. httprecon project - advanced web server fingerprinting.
- Saumil Shah. 2004. An introduction to http fingerprinting.
- Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(56):1929–1958.
- L.J.P. van der Maaten and G.E. Hinton. 2008. Visualizing high-dimensional data using t-sne. Journal of Machine Learning Research, 9(nov):2579–2605. Pagination: 27.
- Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 6000–6010, Red Hook, NY, USA. Curran Associates Inc.
- Tsdae: Using transformer-based sequential denoising auto-encoder for unsupervised sentence embedding learning. arXiv preprint arXiv:2104.06979.
- Huggingface’s transformers: State-of-the-art natural language processing. CoRR, abs/1910.03771.
- Improving the defence against web server fingerprinting by eliminating compliance variation. In 2010 Fifth International Conference on Frontier of Computer Science and Technology, pages 227–232.
- Attention-based bi-lstm model for anomalous http traffic detection. In 2018 15th International Conference on Service Systems and Service Management (ICSSSM), pages 1–6.
- Detecting phishing websites and targets based on urls and webpage links. In 2018 24th International Conference on Pattern Recognition (ICPR), pages 3669–3674.
- Url2vec: Url modeling with character embeddings for fast and accurate phishing website detection. In 2018 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Ubiquitous Computing & Communications, Big Data & Cloud Computing, Social Computing & Networking, Sustainable Computing & Communications (ISPA/IUCC/BDCloud/SocialCom/SustainCom), pages 265–272.
- Big bird: Transformers for longer sequences.