Beyond the Request: Harnessing HTTP Response Headers for Cross-Browser Web Tracker Classification in an Imbalanced Setting (2402.01240v3)
Abstract: The World Wide Web's connectivity is greatly attributed to the HTTP protocol, with HTTP messages offering informative header fields that appeal to disciplines like web security and privacy, especially concerning web tracking. Despite existing research employing HTTP request messages to identify web trackers, HTTP response headers are often overlooked. This study endeavors to design effective machine learning classifiers for web tracker detection using binarized HTTP response headers. Data from the Chrome, Firefox, and Brave browsers, obtained through the traffic monitoring browser extension T.EX, serves as our dataset. Ten supervised models were trained on Chrome data and tested across all browsers, including a Chrome dataset from a year later. The results demonstrated high accuracy, F1-score, precision, recall, and minimal log-loss error for Chrome and Firefox, but subpar performance on Brave, potentially due to its distinct data distribution and feature set. The research suggests that these classifiers are viable for web tracker detection. However, real-world application testing remains pending, and the distinction between tracker types and broader label sources could be explored in future studies.
- 2023. Disconnect. Retrieved Jan 09, 2023 from https://disconnect.me/trackerprotection
- 2023a. EasyList. Retrieved Jan 09, 2023 from https://easylist.to/easylist/easylist.txt
- 2023b. EasyPrivacy. Retrieved Jan 09, 2023 from https://easylist.to/easylist/easyprivacy.txt
- 2023. Google Developers. Retrieved Jan 31, 2023 from https://developers.google.com/machine-learning/data-prep/construct/sampling-splitting/imbalanced-data
- 2023. Kinsta. Retrieved Jan 31, 2023 from https://kinsta.com/browser-market-share/
- 2023. Scikit-learn. Retrieved Jan 23, 2023 from https://scikit-learn.org/stable/modules/naive_bayes.html
- 2023. Statcounter. Retrieved Jan 31, 2023 from https://gs.statcounter.com/browser-market-share/desktop/worldwide
- 2023. Towards Data Science. Retrieved Jan 23, 2023 from https://towardsdatascience.com/intuition-behind-log-loss-score-4e0c9979680a
- 2023. Zenodo. Retrieved Jan 21, 2023 from https://doi.org/10.5281/zenodo.7123945
- FPDetective: Dusting the Web for Fingerprinters. In Proceedings of the 2013 ACM SIGSAC Conference on Computer and Communications Security (Berlin, Germany) (CCS ’13). Association for Computing Machinery, New York, NY, USA, 1129–1140.
- Errors, Misunderstandings, and Attacks: Analyzing the Crowdsourcing Process of Ad-blocking Systems. Proceedings of the Internet Measurement Conference (2019), 230–244.
- FP-Radar: Longitudinal Measurement and Early Detection of Browser Fingerprinting. Proceedings on Privacy Enhancing Technologies 2022 (2021), 557 – 577.
- A Promising Direction for Web Tracking Countermeasures. Proceedings of W2SP (2013).
- Daniel Berrar. 2018. Bayes’ Theorem and Naive Bayes Classifier.
- Leveraging Machine Learning to Improve Unwanted Resource Filtering. In Proceedings of the 2014 Workshop on Artificial Intelligent and Security Workshop (Scottsdale, Arizona, USA) (AISec ’14). Association for Computing Machinery, New York, NY, USA, 95–102.
- Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric. PLoS ONE 12 (2017).
- The Balanced Accuracy and Its Posterior Distribution. In 2010 20th International Conference on Pattern Recognition. 3121–3124.
- A Survey on Web Tracking: Mechanisms, Implications, and Defenses. Proc. IEEE 105, 8 (2017), 1476–1510.
- OmniCrawl: Comprehensive Measurement of Web Tracking With Real Desktop and Mobile Browsers. Proceedings on Privacy Enhancing Technologies 2022, 1 (2022), 227–252.
- TrackSign: Guided Web Tracking Discovery. In IEEE INFOCOM 2021 - IEEE Conference on Computer Communications. 1–10.
- Farah Habib Chanchary and Sonia Chiasson. 2015. User Perceptions of Sharing, Advertising, and Tracking. In Symposium On Usable Privacy and Security.
- SMOTE: Synthetic Minority Over-sampling Technique. ArXiv abs/1106.1813 (2002).
- Detecting Filter List Evasion with Event-Loop-Turn Granularity JavaScript Signatures. 2021 IEEE Symposium on Security and Privacy (SP) 00 (2021), 1715–1729.
- Davide Chicco and Giuseppe Jurman. 2020. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics 21, 1 (2020), 6.
- Hybrid and lightweight detection of third party tracking: Design, implementation, and evaluation. Computer Networks 167 (2020), 106993.
- Reproducibility and Replicability of Web Measurement Studies. In Proceedings of the ACM Web Conference 2022 (Virtual Event, Lyon, France) (WWW ’22). Association for Computing Machinery, New York, NY, USA, 533–544.
- Thomas G. Dietterich. 2000. Ensemble Methods in Machine Learning. In Multiple Classifier Systems. Springer Berlin Heidelberg, Berlin, Heidelberg, 1–15.
- PERCIVAL: Making in-Browser Perceptual Ad Blocking Practical with Deep Learning. In Proceedings of the 2020 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC’20). USENIX Association, USA, Article 26, 14 pages.
- Valery Dudykevych and Vitalii Nechypor. 2016. Detecting Third-Party User Trackers with Cookie Files. 2016 Third International Scientific-Practical Conference Problems of Infocommunications Science and Technology (PIC S&T) (2016), 78–80.
- Steven Englehardt and Arvind Narayanan. 2016. Online Tracking: A 1-Million-Site Measurement and Analysis. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (Vienna, Austria) (CCS ’16). Association for Computing Machinery, New York, NY, USA, 1388–1401.
- Cookies That Give You Away: The Surveillance Implications of Web Tracking. In Proceedings of the 24th International Conference on World Wide Web (Florence, Italy) (WWW ’15). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, 289–299.
- Tracking the pixels: Detecting web trackers via analyzing invisible pixels. arXiv preprint arXiv:1812.01514 (2018).
- On Analyzing Third-party Tracking via Machine Learning. In Proceedings of the 6th International Conference on Information Systems Security and Privacy - ICISSP,. INSTICC, SciTePress, 532–539.
- An Automated Approach for Complementing Ad Blockers’ Blacklists. Proceedings on Privacy Enhancing Technologies 2015, 2 (2015), 282–298. https://doi.org/10.1515/popets-2015-0018
- Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. In Advances in Intelligent Computing, De-Shuang Huang, Xiao-Ping Zhang, and Guang-Bin Huang (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 878–887.
- A Longitudinal Analysis of Online Ad-Blocking Blacklists. 2019 IEEE 44th LCN Symposium on Emerging Topics in Networking (LCN Symposium) 00 (2019), 158–165.
- An Empirical Study of Oversampling and Undersampling for Instance Selection Methods on Imbalance Datasets. In Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, José Ruiz-Shulcloper and Gabriella Sanniti di Baja (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 262–269.
- Fingerprinting the Fingerprinters: Learning to Detect Browser Fingerprinting Behaviors. In 2021 IEEE Symposium on Security and Privacy (SP). 1143–1161. https://doi.org/10.1109/SP40001.2021.00017
- Khaleesi: Breaker of Advertising and Tracking Request Chains. In USENIX Security Symposium.
- Experimental evaluation of ensemble classifiers for imbalance in Big Data. Applied Soft Computing 108 (2021), 107447.
- Like a Pack of Wolves: Community Structure of Web Trackers. In Passive and Active Measurement, Thomas Karagiannis and Xenofontas Dimitropoulos (Eds.). Springer International Publishing, Cham, 42–54.
- Detection of Malicious HTTP Requests Using Header and URL Features. In Proceedings of the Future Technologies Conference (FTC) 2020, Volume 2, Kohei Arai, Supriya Kapoor, and Rahul Bhatia (Eds.). Springer International Publishing, Cham, 449–468.
- Internet Jones and the Raiders of the Lost Trackers: An Archaeological Study of Web Tracking from 1996 to 2016. In USENIX Security Symposium.
- TrackAdvisor: Taking Back Browsing Privacy from Third-Party Trackers. In Passive and Active Measurement, Jelena Mirkovic and Yong Liu (Eds.). Springer International Publishing, Cham, 277–289.
- B.W. Matthews. 1975. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA) - Protein Structure 405, 2 (1975), 442–451.
- A comparison of web privacy protection techniques. Computer Communications 144 (2019), 162–174.
- A Comprehensive Evaluation of HTTP Header Features for Detecting Malicious Websites. In 2019 15th European Dependable Computing Conference (EDCC). 75–82.
- (Do Not) Track Me Sometimes: Users’ Contextual Preferences for Web Tracking. Proceedings on Privacy Enhancing Technologies 2016 (2016), 135 – 154.
- Unsupervised Detection of Web Trackers. In 2015 IEEE Global Communications Conference (GLOBECOM). 1–6. https://doi.org/10.1109/GLOCOM.2015.7417499
- Mimi Mukherjee and Matloob Khushi. 2021. SMOTE-ENC: A Novel SMOTE-Based Method to Generate Synthetic Data for Nominal and Continuous Features. Applied System Innovation 4, 1 (2021).
- Alexandru Niculescu-Mizil and Rich Caruana. 2005. Predicting Good Probabilities with Supervised Learning. In Proceedings of the 22nd International Conference on Machine Learning (Bonn, Germany) (ICML ’05). Association for Computing Machinery, New York, NY, USA, 625–632.
- Anomaly Detection for HTTP Using Convolutional Autoencoders. IEEE Access 6 (2018), 70884–70901.
- Tranco: A Research-Oriented Top Sites Ranking Hardened Against Manipulation. Proceedings 2019 Network and Distributed System Security Symposium (2018).
- t.ex-Graph: Automated Web Tracker Detection Using Centrality Metrics and Data Flow Characteristics. In Proceedings of the 9th International Conference on Information Systems Security and Privacy - Volume 1: ICISSP,. INSTICC, SciTePress, 199–209. https://doi.org/10.5220/0011787300003405
- Towards Real-Time Web Tracking Detection with T.EX - The Transparency EXtension. In Privacy Technologies and Policy, Maurizio Naldi, Giuseppe F. Italiano, Kai Rannenberg, Manel Medina, and Athena Bourka (Eds.). Springer International Publishing, Cham, 3–17.
- Takaya Saito and Marc Rehmsmeier. 2015. The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. PLoS ONE 10 (2015). https://api.semanticscholar.org/CorpusID:14081058
- Cross-Validation for Imbalanced Datasets: Avoiding Overoptimistic and Overfitting Approaches [Research Frontier]. IEEE Computational Intelligence Magazine 13, 4 (2018), 59–76. https://doi.org/10.1109/MCI.2018.2866730
- Who Filters the Filters: Understanding the Growth, Usefulness and Efficiency of Crowdsourced Ad Blocking. In Abstracts of the 2020 SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer Systems (Boston, MA, USA) (SIGMETRICS ’20). Association for Computing Machinery, New York, NY, USA, 75–76.
- Ingo Steinwart and Andreas Christmann. 2008. Support Vector Machines. In Information Science and Statistics.
- Benchmark and Comparison of Tracker-Blockers: Should You Trust Them? 2017 Network Traffic Measurement and Analysis Conference (TMA) (2017), 1–9.
- R.R. Wilcox. 2010. Fundamentals of Modern Statistical Methods: Substantially Improving Power and Accuracy. Springer New York. https://books.google.ie/books?id=PUkOBwAAQBAJ
- A Machine Learning Approach for Detecting Third-Party Trackers on the Web. In Computer Security – ESORICS 2016, Ioannis Askoxylakis, Sotiris Ioannidis, Sokratis Katsikas, and Catherine Meadows (Eds.). Springer International Publishing, 238–258.
- Web Tracking Site Detection Based on Temporal Link Analysis. In 2010 IEEE 24th International Conference on Advanced Information Networking and Applications Workshops. 626–631.
- Zhiju Yang and Chuan Yue. 01 Apr. 2020. A Comparative Measurement Study of Web Tracking on Mobile and Desktop Environments. Proceedings on Privacy Enhancing Technologies 2020, 2 (01 Apr. 2020), 24 – 44.
- Tracking the Trackers. In Proceedings of the 25th International Conference on World Wide Web (Montréal, Québec, Canada) (WWW ’16). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, 121–132.
- Wolf Rieder (3 papers)
- Philip Raschke (3 papers)
- Thomas Cory (4 papers)