Enabling Privacy-Preserving Cyber Threat Detection with Federated Learning (2404.05130v1)
Abstract: Despite achieving good performance and wide adoption, machine learning based security detection models (e.g., malware classifiers) are subject to concept drift and evasive evolution of attackers, which renders up-to-date threat data as a necessity. However, due to enforcement of various privacy protection regulations (e.g., GDPR), it is becoming increasingly challenging or even prohibitive for security vendors to collect individual-relevant and privacy-sensitive threat datasets, e.g., SMS spam/non-spam messages from mobile devices. To address such obstacles, this study systematically profiles the (in)feasibility of federated learning for privacy-preserving cyber threat detection in terms of effectiveness, byzantine resilience, and efficiency. This is made possible by the build-up of multiple threat datasets and threat detection models, and more importantly, the design of realistic and security-specific experiments. We evaluate FL on two representative threat detection tasks, namely SMS spam detection and Android malware detection. It shows that FL-trained detection models can achieve a performance that is comparable to centrally trained counterparts. Also, most non-IID data distributions have either minor or negligible impact on the model performance, while a label-based non-IID distribution of a high extent can incur non-negligible fluctuation and delay in FL training. Then, under a realistic threat model, FL turns out to be adversary-resistant to attacks of both data poisoning and model poisoning. Particularly, the attacking impact of a practical data poisoning attack is no more than 0.14\% loss in model accuracy. Regarding FL efficiency, a bootstrapping strategy turns out to be effective to mitigate the training delay as observed in label-based non-IID scenarios.
- D. Arp, M. Spreitzenbarth, M. Hubner, H. Gascon, K. Rieck, and C. Siemens, “Drebin: Effective and explainable detection of android malware in your pocket.” in Ndss, vol. 14, 2014, pp. 23–26.
- N. McLaughlin, J. M. Del Rincon, B. J. Kang, S. Yerima, P. Miller, S. Sezer, Y. Safaei, E. Trickel, Z. Zhao, A. Doupe, and G. J. Ahn, “Deep android malware detection,” CODASPY 2017 - Proceedings of the 7th ACM Conference on Data and Application Security and Privacy, pp. 301–308, 2017.
- P. K. Roy, J. P. Singh, and S. Banerjee, “Deep learning to filter sms spam,” Future Generation Computer Systems, vol. 102, pp. 524–533, 2020.
- A. Ghourabi, M. A. Mahmood, and Q. M. Alzubi, “A hybrid cnn-lstm model for sms spam detection in arabic and english messages,” Future Internet, vol. 12, no. 9, p. 156, 2020.
- O. Abayomi-Alli, S. Misra, and A. Abayomi-Alli, “A deep learning method for automatic sms spam classification: Performance of learning algorithms on indigenous dataset,” Concurrency and Computation: Practice and Experience, p. e6989, 2022.
- X. Liu, H. Lu, and A. Nayak, “A spam transformer model for sms spam detection,” IEEE Access, vol. 9, pp. 80 253–80 263, 2021.
- T. Wu, S. Liu, J. Zhang, and Y. Xiang, “Twitter spam detection based on deep learning,” in Proceedings of the australasian computer science week multiconference, 2017, pp. 1–8.
- N. Lyamin, D. Kleyko, Q. Delooz, and A. Vinel, “Ai-based malicious network traffic detection in vanets,” IEEE Network, vol. 32, no. 6, pp. 15–21, 2018.
- M. Shafiq, Z. Tian, A. K. Bashir, X. Du, and M. Guizani, “Corrauc: A malicious bot-iot traffic detection method in iot network using machine-learning techniques,” IEEE Internet of Things Journal, vol. 8, no. 5, pp. 3242–3254, 2020.
- C. J. Hoofnagle, B. Van Der Sloot, and F. Z. Borgesius, “The european union general data protection regulation: what it is and what it means,” Information & Communications Technology Law, vol. 28, no. 1, pp. 65–98, 2019.
- E. L. Harding, J. J. Vanto, R. Clark, L. Hannah Ji, and S. C. Ainsworth, “Understanding the scope and impact of the california consumer privacy act of 2018,” Journal of Data Protection & Privacy, vol. 2, no. 3, pp. 234–253, 2019.
- “Privacy, deception and device abuse,” https://support.google.com/googleplay/android-developer/topic/9877467.
- R. Jordaney, K. Sharad, S. K. Dash, Z. Wang, D. Papini, I. Nouretdinov, and L. Cavallaro, “Transcend: Detecting concept drift in malware classification models,” in 26th USENIX security symposium (USENIX security 17), 2017, pp. 625–642.
- K. Xu, Y. Li, R. Deng, K. Chen, and J. Xu, “Droidevolver: Self-evolving android malware detection system,” in 2019 IEEE European Symposium on Security and Privacy (EuroS&P). IEEE, 2019, pp. 47–62.
- S. Tang, X. Mi, Y. Li, X. Wang, and K. Chen, “Clues in tweets: Twitter-guided discovery and analysis of sms spam,” in Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, 2022, pp. 2751–2764.
- M. J. Sheller, G. A. Reina, B. Edwards, J. Martin, and S. Bakas, “Multi-institutional deep learning modeling without sharing patient data: A feasibility study on brain tumor segmentation,” in Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries: 4th International Workshop, BrainLes 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 16, 2018, Revised Selected Papers, Part I 4. Springer, 2019, pp. 92–104.
- A. Hard, K. Rao, R. Mathews, S. Ramaswamy, F. Beaufays, S. Augenstein, H. Eichner, C. Kiddon, and D. Ramage, “Federated learning for mobile keyboard prediction,” arXiv preprint arXiv:1811.03604, 2018.
- P. Blanchard, E. M. El Mhamdi, R. Guerraoui, and J. Stainer, “Machine learning with adversaries: Byzantine tolerant gradient descent,” in Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., vol. 30. Curran Associates, Inc., 2017. [Online]. Available: https://proceedings.neurips.cc/paper/2017/file/f4b9ec30ad9f68f89b29639786cb62ef-Paper.pdf
- A. N. Bhagoji, S. Chakraborty, P. Mittal, and S. Calo, “Analyzing federated learning through an adversarial lens,” in International Conference on Machine Learning. PMLR, 2019, pp. 634–643.
- V. Tolpegin, S. Truex, M. E. Gursoy, and L. Liu, “Data poisoning attacks against federated learning systems,” in Computer Security–ESORICS 2020: 25th European Symposium on Research in Computer Security, ESORICS 2020, Guildford, UK, September 14–18, 2020, Proceedings, Part I 25. Springer, 2020, pp. 480–501.
- G. Baruch, M. Baruch, and Y. Goldberg, “A little is enough: Circumventing defenses for distributed learning,” Advances in Neural Information Processing Systems, vol. 32, 2019.
- Z. Sun, P. Kairouz, A. T. Suresh, and H. B. McMahan, “Can you really backdoor federated learning?” arXiv preprint arXiv:1911.07963, 2019.
- E. Bagdasaryan, A. Veit, Y. Hua, D. Estrin, and V. Shmatikov, “How to backdoor federated learning,” in International Conference on Artificial Intelligence and Statistics. PMLR, 2020, pp. 2938–2948.
- H. Wang, K. Sreenivasan, S. Rajput, H. Vishwakarma, S. Agarwal, J.-y. Sohn, K. Lee, and D. Papailiopoulos, “Attack of the tails: Yes, you really can backdoor federated learning,” Advances in Neural Information Processing Systems, vol. 33, pp. 16 070–16 084, 2020.
- T. A. Almeida, J. M. G. Hidalgo, and A. Yamakami, “Contributions to the study of sms spam filtering: new collection and results,” in Proceedings of the 11th ACM symposium on Document engineering, 2011, pp. 259–262.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
- M. Koroteev, “Bert: a review of applications in natural language processing and understanding,” arXiv preprint arXiv:2103.11943, 2021.
- D. Yin, Y. Chen, R. Kannan, and P. Bartlett, “Byzantine-robust distributed learning: Towards optimal statistical rates,” in Proceedings of the 35th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, J. Dy and A. Krause, Eds., vol. 80. PMLR, 10–15 Jul 2018, pp. 5650–5659. [Online]. Available: https://proceedings.mlr.press/v80/yin18a.html
- H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y. Arcas, “Communication-Efficient Learning of Deep Networks from Decentralized Data,” Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, AISTATS 2017, vol. 54, feb 2017. [Online]. Available: http://arxiv.org/abs/1602.05629
- T. S. Brisimi, R. Chen, T. Mela, A. Olshevsky, I. C. Paschalidis, and W. Shi, “Federated learning of predictive models from federated electronic health records,” International journal of medical informatics, vol. 112, pp. 59–67, 2018.
- I. Dayan, H. R. Roth, A. Zhong, A. Harouni, A. Gentili, A. Z. Abidin, A. Liu, A. B. Costa, B. J. Wood, C.-S. Tsai et al., “Federated learning for predicting clinical outcomes in patients with covid-19,” Nature medicine, vol. 27, no. 10, pp. 1735–1743, 2021.
- T. Zhang, C. He, T. Ma, L. Gao, M. Ma, and S. Avestimehr, “Federated learning for internet of things,” in Proceedings of the 19th ACM Conference on Embedded Networked Sensor Systems, 2021, pp. 413–419.
- E. M. Campos, P. F. Saura, A. González-Vidal, J. L. Hernández-Ramos, J. B. Bernabé, G. Baldini, and A. Skarmeta, “Evaluating Federated Learning for intrusion detection in Internet of Things: Review and challenges,” Computer Networks, vol. 203, no. November 2021, p. 108661, feb 2022. [Online]. Available: https://linkinghub.elsevier.com/retrieve/pii/S1389128621005405
- T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith, “Federated Optimization in Heterogeneous Networks,” 2018. [Online]. Available: http://arxiv.org/abs/1812.06127
- S. P. Karimireddy, S. Kale, M. Mohri, S. Reddi, S. Stich, and A. T. Suresh, “SCAFFOLD: Stochastic controlled averaging for federated learning,” in Proceedings of the 37th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, H. D. III and A. Singh, Eds., vol. 119. PMLR, 13–18 Jul 2020, pp. 5132–5143. [Online]. Available: https://proceedings.mlr.press/v119/karimireddy20a.html
- D. Dimitriadis, K. Kumatani, R. Gmyr, Y. Gaur, and S. E. Eskimez, “Federated transfer learning with dynamic gradient aggregation,” arXiv preprint arXiv:2008.02452, 2020.
- J. Wang, Q. Liu, H. Liang, G. Joshi, and H. Vincent Poor, “Tackling the objective inconsistency problem in heterogeneous federated optimization,” Advances in Neural Information Processing Systems, vol. 2020-Decem, pp. 1–34, 2020.
- A. Hatamizadeh, H. Yin, P. Molchanov, A. Myronenko, W. Li, P. Dogra, A. Feng, M. G. Flores, J. Kautz, D. Xu, and H. R. Roth, “Do Gradient Inversion Attacks Make Federated Learning Unsafe?” IEEE Transactions on Medical Imaging, vol. 42, no. 7, pp. 2044–2056, 2023.
- J. Geiping, H. Bauermeister, H. Dröge, and M. Moeller, “Inverting gradients-how easy is it to break privacy in federated learning?” Advances in Neural Information Processing Systems, vol. 33, pp. 16 937–16 947, 2020.
- L. Zhu, Z. Liu, and S. Han, “Deep leakage from gradients,” Advances in neural information processing systems, vol. 32, 2019.
- C. Zhou, Y. Gao, A. Fu, K. Chen, Z. Dai, Z. Zhang, M. Xue, and Y. Zhang, “PPA: Preference Profiling Attack Against Federated Learning,” no. March, feb 2022. [Online]. Available: http://arxiv.org/abs/2202.04856
- M. Nasr, R. Shokri, and A. Houmansadr, “Comprehensive privacy analysis of deep learning: Passive and active white-box inference attacks against centralized and federated learning,” in 2019 IEEE Symposium on Security and Privacy (SP), 2019, pp. 739–753.
- K. Bonawitz, V. Ivanov, B. Kreuter, A. Marcedone, H. B. McMahan, S. Patel, D. Ramage, A. Segal, and K. Seth, “Practical secure aggregation for privacy-preserving machine learning,” in proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, 2017, pp. 1175–1191.
- J. H. Bell, K. A. Bonawitz, A. Gascón, T. Lepoint, and M. Raykova, “Secure single-server aggregation with (poly) logarithmic overhead,” in Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security, 2020, pp. 1253–1269.
- J. So, B. Güler, and A. S. Avestimehr, “Turbo-aggregate: Breaking the quadratic aggregation barrier in secure federated learning,” IEEE Journal on Selected Areas in Information Theory, vol. 2, no. 1, pp. 479–489, 2021.
- Y. Guo, A. Polychroniadou, E. Shi, D. Byrd, and T. Balch, “Microfedml: Privacy preserving federated learning for small weights,” Cryptology ePrint Archive, 2022.
- M. Fang, X. Cao, J. Jia, and N. Z. Gong, “Local Model Poisoning Attacks to Byzantine-Robust Federated Learning,” Proceedings of the 29th USENIX Security Symposium, pp. 1623–1640, nov 2019. [Online]. Available: http://arxiv.org/abs/1911.11815
- V. Shejwalkar and A. Houmansadr, “Manipulating the byzantine: Optimizing model poisoning attacks and defenses for federated learning,” in NDSS, 2021.
- V. Shejwalkar, A. Houmansadr, P. Kairouz, and D. Ramage, “Back to the drawing board: A critical evaluation of poisoning attacks on production federated learning,” in 2022 IEEE Symposium on Security and Privacy (SP). IEEE, 2022, pp. 1354–1371.
- Y. Chen, L. Su, and J. Xu, “Distributed statistical machine learning in adversarial settings: Byzantine gradient descent,” Proceedings of the ACM on Measurement and Analysis of Computing Systems, vol. 1, no. 2, pp. 1–25, 2017.
- C. Xie, O. Koyejo, and I. Gupta, “Generalized byzantine-tolerant sgd,” arXiv preprint arXiv:1802.10116, 2018.
- K. Pillutla, S. M. Kakade, and Z. Harchaoui, “Robust aggregation for federated learning,” IEEE Transactions on Signal Processing, vol. 70, pp. 1142–1154, 2022.
- R. Guerraoui, S. Rouault et al., “The hidden vulnerability of distributed learning in byzantium,” in International Conference on Machine Learning. PMLR, 2018, pp. 3521–3530.
- Q. Xu, E. W. Xiang, Q. Yang, J. Du, and J. Zhong, “Sms spam detection using noncontent features,” IEEE Intelligent Systems, vol. 27, no. 6, pp. 44–51, 2012.
- N. N. A. Sjarif, N. F. M. Azmi, S. Chuprat, H. M. Sarkan, Y. Yahya, and S. M. Sam, “Sms spam message detection using term frequency-inverse document frequency and random forest algorithm,” Procedia Computer Science, vol. 161, pp. 509–515, 2019.
- A. Rahali, A. H. Lashkari, G. Kaur, L. Taheri, F. Gagnon, and F. Massicotte, “Didroid: Android malware classification and characterization using deep image learning,” in 2020 The 10th international conference on communication and network security, 2020, pp. 70–82.
- L. Wang, H. Wang, R. He, R. Tao, G. Meng, X. Luo, and X. Liu, “Malradar: Demystifying android malware in the new era,” Proceedings of the ACM on Measurement and Analysis of Computing Systems, vol. 6, no. 2, pp. 1–27, 2022.
- E. Mariconti, L. Onwuzurike, P. Andriotis, E. De Cristofaro, G. Ross, and G. Stringhini, “Mamadroid: Detecting android malware by building markov chains of behavioral models,” arXiv preprint arXiv:1612.04433, 2016.
- E. B. Karbab, M. Debbabi, A. Derhab, and D. Mouheb, “Maldozer: Automatic framework for android malware detection using deep learning,” Digital Investigation, vol. 24, pp. S48–S59, 2018.
- W. Wang, M. Zhao, and J. Wang, “Effective android malware detection with a hybrid model based on deep autoencoder and convolutional neural network,” Journal of Ambient Intelligence and Humanized Computing, vol. 10, pp. 3035–3043, 2019.
- H. Gao, S. Cheng, and W. Zhang, “Gdroid: Android malware detection and classification with graph convolutional network,” Computers & Security, vol. 106, p. 102264, 2021.
- Y. Hei, R. Yang, H. Peng, L. Wang, X. Xu, J. Liu, H. Liu, J. Xu, and L. Sun, “Hawk: Rapid android malware detection through heterogeneous graph attention networks,” IEEE Transactions on Neural Networks and Learning Systems, 2021.
- R. Nix and J. Zhang, “Classification of android apps and malware using deep neural networks,” in 2017 International joint conference on neural networks (IJCNN). IEEE, 2017, pp. 1871–1878.
- D. Srinivasa Rao and E. Ajith Jubilson, “SMS Spam Detection Using Federated Learning,” Lecture Notes on Data Engineering and Communications Technologies, vol. 163, pp. 547–562, 2023.
- R. H. Hsu, Y. C. Wang, C. I. Fan, B. Sun, T. Ban, T. Takahashi, T. W. Wu, and S. W. Kao, “A Privacy-Preserving Federated Learning System for Android Malware Detection Based on Edge Computing,” Proceedings - 2020 15th Asia Joint Conference on Information Security, AsiaJCIS 2020, pp. 128–136, 2020.
- W. Fang, J. He, W. Li, X. Lan, Y. Chen, T. Li, J. Huang, and L. Zhang, “Comprehensive Android Malware Detection Based on Federated Learning Architecture,” IEEE Transactions on Information Forensics and Security, vol. 18, pp. 3977–3990, 2023.
- A. Chaudhuri, A. Nandi, and B. Pradhan, “A Dynamic Weighted Federated Learning for Android Malware Classification,” Lecture Notes in Networks and Systems, vol. 627 LNNS, pp. 147–159, 2023.
- H. Fereidooni, A. Dmitrienko, P. Rieger, M. Miettinen, A.-R. Sadeghi, and F. Madlener, “Fedcri: Federated mobile cyber-risk intelligence,” in Network and Distributed Systems Security (NDSS) Symposium, 2022.
- I. Inuwa-Dutse, M. Liptrott, and I. Korkontzelos, “Detection of spam-posting accounts on twitter,” Neurocomputing, vol. 315, pp. 496–511, 2018. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0925231218308798
- O. Varol, E. Ferrara, C. Davis, F. Menczer, and A. Flammini, “Online Human-Bot Interactions: Detection, Estimation, and Characterization,” Proceedings of the International AAAI Conference on Web and Social Media, vol. 11, no. 1, pp. 280–289, may 2017. [Online]. Available: https://ojs.aaai.org/index.php/ICWSM/article/view/14871
- D. Arp, E. Quiring, F. Pendlebury, A. Warnecke, F. Pierazzi, C. Wressnegger, L. Cavallaro, and K. Rieck, “Dos and don’ts of machine learning in computer security,” in 31st USENIX Security Symposium (USENIX Security 22). Boston, MA: USENIX Association, Aug. 2022, pp. 3971–3988. [Online]. Available: https://www.usenix.org/conference/usenixsecurity22/presentation/arp
- A. H. Lashkari, A. F. A. Kadir, L. Taheri, and A. A. Ghorbani, “Toward developing a systematic approach to generate benchmark android malware datasets and classification,” in 2018 International Carnahan conference on security technology (ICCST). IEEE, 2018, pp. 1–7.
- K. Allix, T. F. Bissyandé, J. Klein, and Y. Le Traon, “Androzoo: Collecting millions of android apps for the research community,” in Proceedings of the 13th International Conference on Mining Software Repositories, ser. MSR ’16. New York, NY, USA: ACM, 2016, pp. 468–471. [Online]. Available: http://doi.acm.org/10.1145/2901739.2903508
- H. Zhang, W. Zhang, Z. Lv, A. K. Sangaiah, T. Huang, and N. Chilamkurti, “MALDC: a depth detection method for malware based on behavior chains,” World Wide Web, vol. 23, no. 2, pp. 991–1010, 2020.
- C. Jindal, C. Salls, H. Aghakhani, K. Long, C. Kruegel, and G. Vigna, “Neurlux: Dynamic malware analysis without feature engineering,” ACM International Conference Proceeding Series, pp. 444–455, 2019.
- Q. Li, Y. Diao, Q. Chen, and B. He, “Federated Learning on Non-IID Data Silos: An Experimental Study,” Proceedings - International Conference on Data Engineering, vol. 2022-May, pp. 965–978, 2022.
- D. J. Beutel, T. Topal, A. Mathur, X. Qiu, J. Fernandez-Marques, Y. Gao, L. Sani, H. L. Kwing, T. Parcollet, P. P. d. Gusmão, and N. D. Lane, “Flower: A friendly federated learning research framework,” arXiv preprint arXiv:2007.14390, 2020.
- K. Bonawitz, H. Eichner, W. Grieskamp, D. Huba, A. Ingerman, V. Ivanov, C. Kiddon, J. Konečný, S. Mazzocchi, H. B. McMahan, T. Van Overveldt, D. Petrou, D. Ramage, and J. Roselander, “Towards Federated Learning at Scale: System Design,” feb 2019. [Online]. Available: http://arxiv.org/abs/1902.01046
- “Dirichlet distribution,” https://en.wikipedia.org/wiki/Dirichlet_distribution.
- A. Fallah, A. Mokhtari, and A. Ozdaglar, “Personalized Federated Learning: A Meta-Learning Approach,” pp. 1–29, feb 2020. [Online]. Available: http://arxiv.org/abs/2002.07948
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.