The Impact of Train-Test Leakage on Machine Learning-based Android Malware Detection
Abstract: When machine learning is used for Android malware detection, an app needs to be represented in a numerical format for training and testing. We identify a widespread occurrence of distinct Android apps that have identical or nearly identical app representations. In particular, among app samples in the testing dataset, there can be a significant percentage of apps that have an identical or nearly identical representation to an app in the training dataset. This will lead to a data leakage problem that inflates a machine learning model's performance as measured on the testing dataset. The data leakage not only could lead to overly optimistic perceptions on the machine learning models' ability to generalize beyond the data on which they are trained, in some cases it could also lead to qualitatively different conclusions being drawn from the research. We present two case studies to illustrate this impact. In the first case study, the data leakage inflated the performance results but did not impact the overall conclusions made by the researchers in a qualitative way. In the second case study, the data leakage problem would have led to qualitatively different conclusions being drawn from the research. We further propose a leak-aware scheme to construct a machine learning-based Android malware detector, and show that it can improve upon the overall detection performance.
- S. Kapoor and A. Narayanan, “Leakage and the reproducibility crisis in machine-learning-based science,” Patterns, vol. 4, no. 9, p. 100804, 2023, https://doi.org/10.1016/j.patter.2023.100804.
- A. Elangovan, J. He, and K. Verspoor, “Memorization vs. generalization: Quantifying data leakage in NLP performance evaluation,” 2021. [Online]. Available: https://arxiv.org/abs/2102.01818
- I. E. Tampu, A. Eklund, and N. Haj-Hosseini, “Inflation of test accuracy due to data leakage in deep learning-based classification of OCT images,” Scientific Data, vol. 9, no. 1, 2022.
- A. Ferrer Florensa, J. Almagro Armenteros, H. Nielsen, F. Aarestrup, and P. Clausen, “SpanSeq: similarity-based sequence data splitting method for improved development and assessment of deep learning projects,” NAR Genomics and Bioinformatics, vol. 6, no. 3, 2024.
- S. Kaufman, S. Rosset, C. Perlich, and O. Stitelman, “Leakage in data mining: Formulation, detection, and avoidance,” ACM Transactions on Knowledge Discovery from Data, vol. 6, no. 4, 2012.
- J. W. Rayid Ghani, Joe Walsh. (2020) Top 10 ways your Machine Learning models may have leakage. [Online]. Available: https://www.rayidghani.com/2020/01/24/top-10-ways-your-machine-learning-models-may-have-leakage/
- D. Soni, “Data Leakage in Machine Learning,” Medium. https://towardsdatascience.com/data-leakagein-machine-learning-10bdd3eec742, 2019.
- D. Arp, E. Quiring, F. Pendlebury, A. Warnecke, F. Pierazzi, C. Wressnegger, L. Cavallaro, and K. Rieck, “Dos and don’ts of machine learning in computer security,” in 31st USENIX Security Symposium (USENIX Security 22). Boston, MA: USENIX Association, 2022, pp. 3971–3988.
- F. Pendlebury, F. Pierazzi, R. Jordaney, J. Kinder, and L. Cavallaro, “TESSERACT: Eliminating Experimental Bias in Malware Classification across Space and Time,” in 28th USENIX Security Symposium (USENIX Security 19). Santa Clara, CA: USENIX Association, Aug. 2019, pp. 729–746. [Online]. Available: https://www.usenix.org/conference/usenixsecurity19/presentation/pendlebury
- Y. Chen, Z. Ding, and D. Wagner, “Continuous Learning for Android Malware Detection,” in 32nd USENIX Security Symposium. Anaheim, CA: USENIX Association, Aug. 2023, pp. 1127–1144.
- Z. Yuan, Y. Lu, Z. Wang, and Y. Xue, “Droid-sec: Deep learning in android malware detection,” SIGCOMM Comput. Commun. Rev., vol. 44, no. 4, p. 371–372, aug 2014.
- X. Su, D. Zhang, W. Li, and K. Zhao, “A Deep Learning Approach to Android Malware Feature Learning and Detection,” in 2016 IEEE Trustcom/BigDataSE/ISPA. Tianjin, China: IEEE, 2016, pp. 244–251.
- N. McLaughlin, J. Martinez del Rincon, B. Kang, S. Yerima, P. Miller, S. Sezer, Y. Safaei, E. Trickel, Z. Zhao, A. Doupé, and G. Joon Ahn, “Deep Android Malware Detection,” in Proceedings of the Seventh ACM on Conference on Data and Application Security and Privacy, ser. CODASPY ’17. New York, NY, USA: Association for Computing Machinery, 2017, p. 301–308.
- E. B. Karbab, M. Debbabi, A. Derhab, and D. Mouheb, “MalDozer: Automatic framework for android malware detection using deep learning,” Digital Investigation, vol. 24, pp. S48–S59, 2018.
- T. N. Kipf and M. Welling, “Semi-Supervised Classification with Graph Convolutional Networks,” 2016.
- T. H.-D. Huang and H.-Y. Kao, “R2-D2: ColoR-inspired Convolutional NeuRal Network (CNN)-based AndroiD Malware Detections,” in 2018 IEEE International Conference on Big Data (Big Data), 2018, pp. 2633–2642.
- N. Daoudi, J. Samhi, A. K. Kabore, K. Allix, T. F. Bissyandé, and J. Klein, “DexRay: A Simple, yet Effective Deep Learning Approach to Android Malware Detection Based on Image Representation of Bytecode,” in Deployable Machine Learning for Security Defense. Cham: Springer International Publishing, 2021, pp. 81–106.
- Z. Xu, K. Ren, S. Qin, and F. Craciun, “CDGDroid: Android Malware Detection Based on Deep Learning Using CFG and DFG,” in Formal Methods and Software Engineering. Toulouse, France: Springer International Publishing, 2018, pp. 177–193.
- T. S. John, T. Thomas, and S. Emmanuel, “Graph Convolutional Networks for Android Malware Detection with System Call Graphs,” in 2020 Third ISEA Conference on Security and Privacy (ISEA-ISAP). Guwahati, India: IEEE, 2020, pp. 162–170.
- H. Gao, S. Cheng, and W. Zhang, “GDroid: Android malware detection and classification with graph convolutional network,” Computers & Security, vol. 106, p. 102264, 2021.
- Y. Ding, Y. Fu, O. Ibrahim, C. Sitawarin, X. Chen, B. Alomair, D. Wagner, B. Ray, and Y. Chen, “Vulnerability detection with code language models: How far are we?” 2024. [Online]. Available: https://arxiv.org/abs/2403.18624
- S. Axelsson, “The base-rate fallacy and its implications for the difficulty of intrusion detection,” in Proceedings of the 6th ACM conference on Computer and Communications Security (CCS’99), November 1999, pp. 1–7.
- R. Jordaney, K. Sharad, S. K. Dash, Z. Wang, D. Papini, I. Nouretdinov, and L. Cavallaro, “Transcend: Detecting Concept Drift in Malware Classification Models,” in 26th USENIX Security Symposium (USENIX Security 17). Vancouver, BC: USENIX Association, Aug. 2017, pp. 625–642. [Online]. Available: https://www.usenix.org/conference/usenixsecurity17/technical-sessions/presentation/jordaney
- X. Zhang, Y. Zhang, M. Zhong, D. Ding, Y. Cao, Y. Zhang, M. Zhang, and M. Yang, “Enhancing State-of-the-art Classifiers with API Semantics to Detect Evolved Android Malware,” in Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security. New York, NY, USA: Association for Computing Machinery, 2020, p. 757–770.
- K. Allix, T. F. Bissyandé, J. Klein, and Y. Le Traon, “AndroZoo: Collecting Millions of Android Apps for the Research Community,” in Proceedings of the 13th International Conference on Mining Software Repositories, ser. MSR ’16. New York, NY, USA: ACM, 2016, pp. 468–471. [Online]. Available: http://doi.acm.org/10.1145/2901739.2903508
- D. Arp, M. Spreitzenbarth, M. Hubner, H. Gascon, and K. Rieck, “DREBIN: Effective and Explainable Detection of Android Malware in Your Pocket,” in 21st Annual Network and Distributed System Security Symposium, NDSS 2014. San Diego, California, USA: The Internet Society, February 23-26 2014.
- M. Allamanis, “The adverse effects of code duplication in machine learning models of code,” in Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software, ser. Onward! 2019. New York, NY, USA: Association for Computing Machinery, 2019, p. 143–153. [Online]. Available: https://doi.org/10.1145/3359591.3359735
- B. Molina-Coronado, U. Mori, A. Mendiburu, and J. Miguel-Alonso, “Towards a fair comparison and realistic evaluation framework of android malware detectors based on static analysis and machine learning,” Computers & Security, vol. 124, p. 102996, 2023, https://doi.org/10.1016/j.cose.2022.102996.
- R. Surendran, “On Impact of Semantically Similar Apps in Android Malware Datasets,” arXiv preprint arXiv:2112.02606, 2021, https://doi.org/10.48550/arXiv.2112.02606.
- W. Zhou, Y. Zhou, X. Jiang, and P. Ning, “Detecting Repackaged Smartphone Applications in Third-Party Android Marketplaces,” in Proceedings of the Second ACM Conference on Data and Application Security and Privacy, ser. CODASPY ’12. New York, NY, USA: Association for Computing Machinery, 2012, p. 317–326, 10.1145/2133601.2133640.
- S. Hanna, L. Huang, E. Wu, S. Li, C. Chen, and D. Song, “Juxtapp: A Scalable System for Detecting Code Reuse among Android Applications,” in Detection of Intrusions and Malware, and Vulnerability Assessment, U. Flegel, E. Markatos, and W. Robertson, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2013, pp. 62–81.
- A. Desnos, “Android: Static Analysis Using Similarity Distance,” in 2012 45th Hawaii International Conference on System Sciences. Maui, HI, USA: IEEE, 2012, pp. 5394–5403, 10.1109/HICSS.2012.114.
- J. Crussell, C. Gibler, and H. Chen, “AnDarwin: Scalable Detection of Semantically Similar Android Applications,” in Computer Security – ESORICS 2013, J. Crampton, S. Jajodia, and K. Mayes, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2013, pp. 182–199.
- X. Sun, Y. Zhongyang, Z. Xin, B. Mao, and L. Xie, “Detecting Code Reuse in Android Applications Using Component-Based Control Flow Graph,” in ICT Systems Security and Privacy Protection, N. Cuppens-Boulahia, F. Cuppens, S. Jajodia, A. Abou El Kalam, and T. Sans, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2014, pp. 142–155.
- H. Gonzalez, N. Stakhanova, and A. A. Ghorbani, “DroidKin: Lightweight Detection of Android Apps Similarity,” in International Conference on Security and Privacy in Communication Networks, J. Tian, J. Jing, and M. Srivatsa, Eds. Cham: Springer International Publishing, 2015, pp. 436–453.
- M. Linares-Vásquez, A. Holtzhauer, and D. Poshyvanyk, “On Automatically Detecting Similar Android Apps,” in 2016 IEEE 24th International Conference on Program Comprehension (ICPC). Austin, TX, USA: IEEE, 2016, pp. 1–10, 10.1109/ICPC.2016.7503721.
- L. Li, T. F. Bissyandé, and J. Klein, “SimiDroid: Identifying and Explaining Similarities in Android Apps,” in 2017 IEEE Trustcom/BigDataSE/ICESS. Sydney, NSW, Australia: IEEE, 2017, pp. 136–143, 10.1109/Trustcom/BigDataSE/ICESS.2017.230.
- L. Li, T. F. Bissyandé, H.-Y. Wang, and J. Klein, “On Identifying and Explaining Similarities in Android Apps,” Journal of Computer Science and Technology, vol. 34, pp. 437–455, 2019, https://doi.org/10.1007/s11390-019-1918-8.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.