An Empirical Study of Malicious Code In PyPI Ecosystem (2309.11021v1)
Abstract: PyPI provides a convenient and accessible package management platform to developers, enabling them to quickly implement specific functions and improve work efficiency. However, the rapid development of the PyPI ecosystem has led to a severe problem of malicious package propagation. Malicious developers disguise malicious packages as normal, posing a significant security risk to end-users. To this end, we conducted an empirical study to understand the characteristics and current state of the malicious code lifecycle in the PyPI ecosystem. We first built an automated data collection framework and collated a multi-source malicious code dataset containing 4,669 malicious package files. We preliminarily classified these malicious code into five categories based on malicious behaviour characteristics. Our research found that over 50% of malicious code exhibits multiple malicious behaviours, with information stealing and command execution being particularly prevalent. In addition, we observed several novel attack vectors and anti-detection techniques. Our analysis revealed that 74.81% of all malicious packages successfully entered end-user projects through source code installation, thereby increasing security risks. A real-world investigation showed that many reported malicious packages persist in PyPI mirror servers globally, with over 72% remaining for an extended period after being discovered. Finally, we sketched a portrait of the malicious code lifecycle in the PyPI ecosystem, effectively reflecting the characteristics of malicious code at different stages. We also present some suggested mitigations to improve the security of the Python open-source ecosystem.
- “Python package index.” https://pypi.org/ [Accessed December 20, 2022].
- A. Bagmar, J. Wedgwood, D. Levin, and J. Purtilo, “I know what you imported last summer: A study of security threats in thepython ecosystem,” arXiv preprint arXiv:2102.06301, 2021.
- P. Ladisa, H. Plate, M. Martinez, and O. Barais, “Sok: Taxonomy of attacks on open-source software supply chains,” in 2023 IEEE Symposium on Security and Privacy (SP), pp. 167–184, IEEE Computer Society, 2022.
- “Malicious pypi package opens backdoors on windows, linux, and macs.” https://www.bleepingcomputer.com/news/security/malicious-pypi-package-opens-backdoors-on-windows-linux-and-macs/ [Accessed January 14, 2023].
- “Dozens of pypi packages caught dropping ’w4sp’ info-stealing malware.” https://www.bleepingcomputer.com/news/security/dozens-of-pypi-packages-caught-dropping-w4sp-info-stealing-malware/ [Accessed January 14, 2023].
- “New: Pypi packages ‘xss’ and ‘easyfuncsys’ steal roblox session cookies and discord tokens, and drop suspicious exes..” https://twitter.com/ax_sharma/status/1488937021750005762/ [Accessed January 14, 2023].
- B. Kaplan and J. Qian, “A survey on common threats in npm and pypi registries,” in Deployable Machine Learning for Security Defense: Second International Workshop, MLHat 2021, Virtual Event, August 15, 2021, Proceedings 2, pp. 132–156, Springer, 2021.
- M. Alfadel, D. E. Costa, and E. Shihab, “Empirical analysis of security vulnerabilities in python packages,” Empirical Software Engineering, vol. 28, no. 3, p. 59, 2023.
- M. Valiev, B. Vasilescu, and J. Herbsleb, “Ecosystem-level determinants of sustained activity in open-source projects: A case study of the pypi ecosystem,” in Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 644–655, 2018.
- M. Ohm, H. Plate, A. Sykosch, and M. Meier, “Backstabber’s knife collection: A review of open source software supply chain attacks,” in International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment, pp. 23–43, Springer, 2020.
- M. J. H. Faruk, H. Shahriar, M. Valero, F. L. Barsha, S. Sobhan, M. A. Khan, M. Whitman, A. Cuzzocrea, D. Lo, A. Rahman, et al., “Malware detection and prevention using artificial intelligence techniques,” in 2021 IEEE International Conference on Big Data (Big Data), pp. 5369–5377, IEEE, 2021.
- M. O. F. Rokon, R. Islam, A. Darki, E. E. Papalexakis, and M. Faloutsos, “Sourcefinder: Finding malware source-code from publicly available repositories in github.,” in RAID, pp. 149–163, 2020.
- R. Duan, O. Alrawi, R. P. Kasturi, R. Elder, B. Saltaformaggio, and W. Lee, “Towards measuring supply chain attacks on package managers for interpreted languages,” arXiv preprint arXiv:2002.01139, 2020.
- F. Fischer, K. Böttinger, H. Xiao, C. Stransky, Y. Acar, M. Backes, and S. Fahl, “Stack overflow considered harmful? the impact of copy&paste on android application security,” in 2017 IEEE Symposium on Security and Privacy (SP), pp. 121–136, IEEE, 2017.
- L. Li, D. Li, T. F. Bissyandé, J. Klein, Y. Le Traon, D. Lo, and L. Cavallaro, “Understanding android app piggybacking: A systematic study of malicious code grafting,” IEEE Transactions on Information Forensics and Security, vol. 12, no. 6, pp. 1269–1284, 2017.
- S. Samtani, K. Chinn, C. Larson, and H. Chen, “Azsecure hacker assets portal: Cyber threat intelligence and malware analysis,” in 2016 IEEE conference on intelligence and security informatics (ISI), pp. 19–24, Ieee, 2016.
- “The largest collection of malware source code, samples, and papers on the internet..” https://www.vx-underground.org/.
- “An archive of exploits for the purpose of public security..” https://www.exploit-db.com/.
- “User-driven attack packages.” https://hstechdocs.helpsystems.com/manuals/cobaltstrike/current/userguide/content/topics/init-access_user-driven-attack-packages.htm/ [Accessed January 23, 2023].
- “Software for adversary simulations and red team operations.” https://www.cobaltstrike.com/.
- “Open source vulnerability database.” https://security.snyk.io/ [Accessed December 14, 2022].
- “Sonatype: Software supply chain security - devsecops.” https://www.sonatype.com/ [Accessed January 30, 2023].
- “Pypi mirror of tsinghua university.” https://pypi.tuna.tsinghua.edu.cn/simple/ [Accessed December 20, 2022].
- “Pypi mirror of tencent company.” https://mirrors.cloud.tencent.com/pypi/simple [Accessed December 20, 2022].
- “Pypi mirror of alibaba company.” https://mirrors.aliyun.com/pypi/simple/ [Accessed December 20, 2022].
- “Pypi mirror of douban company.” http://pypi.doubanio.com/simple/ [Accessed December 20, 2022].
- “Analyse suspicious files, domains, ips and urls to detect malware and other breaches, automatically share them with the security community..” https://www.virustotal.com/gui/home/upload/ [Accessed December 20, 2022].
- M. Shahzad, M. Z. Shafiq, and A. X. Liu, “Large scale characterization of software vulnerability life cycles,” IEEE Transactions on Dependable and Secure Computing, vol. 17, no. 4, pp. 730–744, 2019.
- L. Liu, L. Wei, W. Zhang, M. Wen, Y. Liu, and S.-C. Cheung, “Characterizing transaction-reverting statements in ethereum smart contracts,” in 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 630–641, IEEE, 2021.
- “The world’s most used penetration testing framework..” https://www.metasploit.com/.
- “Pypi mirror of huawei company.” https://mirrors.huaweicloud.com/repository/pypi/simple/ [Accessed December 20, 2022].
- “Pypi mirror of https://mirrors.bfsu.edu.cn/pypi/web/simple/.” https://mirrors.bfsu.edu.cn/pypi/web/simple/ [Accessed December 20, 2022].
- “Pypi mirror of netease company.” https://mirrors.163.com/pypi/simple/ [Accessed December 20, 2022].
- “Pypi mirror of southern university of science and technology.” https://mirrors.sustech.edu.cn/pypi/simple/ [Accessed December 20, 2022].
- “Pypi mirror of rstudio.” https://packagemanager.rstudio.com/pypi/latest/simple/ [Accessed December 20, 2022].
- “Pypi mirror of kakao.” http://mirror.kakao.com/pypi/simple/ [Accessed December 20, 2022].
- “Pypi mirror of universitas padjadjaran.” https://file.unpad.ac.id/pypi/web/ [Accessed December 20, 2022].
- E. Wyss, A. Wittman, D. Davidson, and L. De Carli, “Wolf at the door: Preventing install-time attacks in npm with latch,” in Proceedings of the 2022 ACM on Asia Conference on Computer and Communications Security, pp. 1139–1153, 2022.
- D. Gonzalez, T. Zimmermann, P. Godefroid, and M. Schäfer, “Anomalicious: Automated detection of anomalous and potentially malicious commits on github,” in 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), pp. 258–267, IEEE, 2021.
- A. Cao and B. Dolan-Gavitt, “What the fork? finding and analyzing malware in github forks,” in Proc. of NDSS, vol. 22, 2022.
- Y. Zhang, Y. Fan, S. Hou, Y. Ye, X. Xiao, P. Li, C. Shi, L. Zhao, and S. Xu, “Cyber-guided deep neural network for malicious repository detection in github,” in 2020 IEEE International Conference on Knowledge Graph (ICKG), pp. 458–465, IEEE, 2020.
- A. Zhou, T. Huang, C. Huang, D. Li, and C. Song, “Pycomm: Malicious commands detection model for python scripts,” Journal of Intelligent & Fuzzy Systems, no. Preprint, pp. 1–13, 2022.
- Y. Fang, M. Xie, and C. Huang, “Pbdt: Python backdoor detection model based on combined features,” Security and Communication Networks, vol. 2021, 2021.
- Y. Gu, L. Ying, Y. Pu, X. Hu, H. Chai, R. Wang, X. Gao, and H. Duan, “Investigating package related security threats in software registries,” in 2023 IEEE Symposium on Security and Privacy (SP), pp. 1151–1168, IEEE Computer Society, 2022.
- G. Liang, X. Zhou, Q. Wang, Y. Du, and C. Huang, “Malicious packages lurking in user-friendly python package index,” in 2021 IEEE 20th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom), pp. 606–613, IEEE, 2021.
- D.-L. Vu, F. Massacci, I. Pashchenko, H. Plate, and A. Sabetta, “Lastpymile: identifying the discrepancy between sources and packages,” in Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 780–792, 2021.