Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
140 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Fingerprinting web servers through Transformer-encoded HTTP response headers (2404.00056v1)

Published 26 Mar 2024 in cs.CR, cs.LG, and cs.NI

Abstract: We explored leveraging state-of-the-art deep learning, big data, and natural language processing to enhance the detection of vulnerable web server versions. Focusing on improving accuracy and specificity over rule-based systems, we conducted experiments by sending various ambiguous and non-standard HTTP requests to 4.77 million domains and capturing HTTP response status lines. We represented these status lines through training a BPE tokenizer and RoBERTa encoder for unsupervised masked LLMing. We then dimensionality reduced and concatenated encoded response lines to represent each domain's web server. A Random Forest and multilayer perceptron (MLP) classified these web servers, and achieved 0.94 and 0.96 macro F1-score, respectively, on detecting the five most popular origin web servers. The MLP achieved a weighted F1-score of 0.55 on classifying 347 major type and minor version pairs. Analysis indicates that our test cases are meaningful discriminants of web server types. Our approach demonstrates promise as a powerful and flexible alternative to rule-based systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Abien Fred Agarap. 2018. Deep learning using rectified linear units (relu).
  2. Shadi Al-Hakimi and Freek Bax. 2021. Hunting for malicious infrastructure using big data. SNE Master Research Projects 2020 - 2021, University of Amsterdam.
  3. Automatic Certificate Management Environment (ACME). RFC 8555.
  4. Hypertext Transfer Protocol Version 2 (HTTP/2). RFC 7540.
  5. Longformer: The long-document transformer.
  6. Hypertext Transfer Protocol – HTTP/1.0. RFC 1945.
  7. Automated generation of web server fingerprints.
  8. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  9. Hypertext Transfer Protocol – HTTP/1.1. RFC 2068.
  10. Http2vec: Embedding of http requests for detection of anomalous traffic.
  11. Namig J. Guliyev and Vugar E. Ismailov. 2018. Approximation capability of two hidden layer feedforward neural networks with fixed weights. Neurocomputing, 316:262–269.
  12. Jeff Heaton. 2008. Introduction to Neural Networks for Java, 2nd Edition, 2nd edition. Heaton Research, Inc.
  13. Analyzing and summarizing the web server detection technology based on http. In 2015 6th IEEE International Conference on Software Engineering and Service Science (ICSESS), pages 1042–1045.
  14. Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization.
  15. Detection of malicious http requests using header and url features. In Proceedings of the Future Technologies Conference (FTC) 2020, Volume 2, pages 449–468, Cham. Springer International Publishing.
  16. Detecting and defending against web-server fingerprinting. In 18th Annual Computer Security Applications Conference, 2002. Proceedings., pages 321–330.
  17. Dustin William Lee. 2001. Hmap: A technique and tool for remote identification of http servers.
  18. The weighted word2vec paragraph vectors for anomaly detection over http traffic. IEEE Access, 8:141787–141798.
  19. Roberta: A robustly optimized bert pretraining approach.
  20. Gordon Fyodor Lyon. 2009. Nmap Network Scanning: The Official Nmap Project Guide to Network Discovery and Security Scanning. Insecure, Sunnyvale, CA, USA.
  21. Detecting malware-infected devices using the http header patterns. IEICE Transactions on Information and Systems, E101.D(5):1370–1379.
  22. Phishmon: A machine learning framework for detecting phishing webpages. In 2018 IEEE International Conference on Intelligence and Security Informatics (ISI), pages 220–225.
  23. Using xgboost to discover infected hosts based on http traffic. Security and Communication Networks, 2019:2182615.
  24. Tranco: A research-oriented top sites ranking hardened against manipulation. In Proceedings 2019 Network and Distributed System Security Symposium. Internet Society.
  25. Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
  26. Marc Ruef. 2007. httprecon project - advanced web server fingerprinting.
  27. Saumil Shah. 2004. An introduction to http fingerprinting.
  28. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(56):1929–1958.
  29. L.J.P. van der Maaten and G.E. Hinton. 2008. Visualizing high-dimensional data using t-sne. Journal of Machine Learning Research, 9(nov):2579–2605. Pagination: 27.
  30. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 6000–6010, Red Hook, NY, USA. Curran Associates Inc.
  31. Tsdae: Using transformer-based sequential denoising auto-encoder for unsupervised sentence embedding learning. arXiv preprint arXiv:2104.06979.
  32. Huggingface’s transformers: State-of-the-art natural language processing. CoRR, abs/1910.03771.
  33. Improving the defence against web server fingerprinting by eliminating compliance variation. In 2010 Fifth International Conference on Frontier of Computer Science and Technology, pages 227–232.
  34. Attention-based bi-lstm model for anomalous http traffic detection. In 2018 15th International Conference on Service Systems and Service Management (ICSSSM), pages 1–6.
  35. Detecting phishing websites and targets based on urls and webpage links. In 2018 24th International Conference on Pattern Recognition (ICPR), pages 3669–3674.
  36. Url2vec: Url modeling with character embeddings for fast and accurate phishing website detection. In 2018 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Ubiquitous Computing & Communications, Big Data & Cloud Computing, Social Computing & Networking, Sustainable Computing & Communications (ISPA/IUCC/BDCloud/SocialCom/SustainCom), pages 265–272.
  37. Big bird: Transformers for longer sequences.

Summary

We haven't generated a summary for this paper yet.