Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Large-scale Fine-grained Analysis of Packages in Open-Source Software Ecosystems (2404.11467v1)

Published 17 Apr 2024 in cs.SE and cs.CR

Abstract: Package managers such as NPM, Maven, and PyPI play a pivotal role in open-source software (OSS) ecosystems, streamlining the distribution and management of various freely available packages. The fine-grained details within software packages can unveil potential risks within existing OSS ecosystems, offering valuable insights for detecting malicious packages. In this study, we undertake a large-scale empirical analysis focusing on fine-grained information (FGI): the metadata, static, and dynamic functions. Specifically, we investigate the FGI usage across a diverse set of 50,000+ legitimate and 1,000+ malicious packages. Based on this diverse data collection, we conducted a comparative analysis between legitimate and malicious packages. Our findings reveal that (1) malicious packages have less metadata content and utilize fewer static and dynamic functions than legitimate ones; (2) malicious packages demonstrate a higher tendency to invoke HTTP/URL functions as opposed to other application services, such as FTP or SMTP; (3) FGI serves as a distinguishable indicator between legitimate and malicious packages; and (4) one dimension in FGI has sufficient distinguishable capability to detect malicious packages, and combining all dimensions in FGI cannot significantly improve overall performance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (66)
  1. Empirical analysis of security vulnerabilities in python packages. In 2021 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 446–457.
  2. Alibaba. accessible by 2023. Alibaba Cloud RubyGems mirror for expedited downloads. https://mirrors.aliyun.com/rubygems/.
  3. Aliyun. accessible by 2023. Aliyun NPM mirror by Alibaba Cloud. https://npm.aliyun.com/.
  4. The evolution of project inter-dependencies in a software ecosystem: The case of apache. In 2013 IEEE international conference on software maintenance. IEEE, 280–289.
  5. Bertus. 2018. Cryptocurrency clipboard hijacker discovered in pypi repository. https://medium.com@bertusk/cryptocurrency-clipboard-hijacker-discovered-in-pypi-repository-b66b8a534a8.
  6. A look in the mirror: Attacks on package managers. In Proceedings of the 15th ACM conference on Computer and communications security. 565–574.
  7. Vitaly Chaykovsky. 1991. Linux syscall tracer. https://strace.io/
  8. Ruby community. 2020. RubyGems.org is the Ruby community’s gem hosting service. https://rubygems.org/.
  9. Eleni Constantinou and Tom Mens. 2017. An empirical comparison of developer retention in the RubyGems and npm software ecosystems. Innovations in Systems and Software Engineering 13, 2 (2017), 101–115.
  10. An empirical comparison of dependency issues in OSS packaging ecosystems. In 2017 IEEE 24th international conference on software analysis, evolution and reengineering (SANER). IEEE, 2–12.
  11. On the evolution of technical lag in the npm package dependency network. In 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 404–414.
  12. On the impact of security vulnerabilities in the npm package dependency network. In Proceedings of the 15th international conference on mining software repositories. IEE, 181–191.
  13. Tapajit Dey and Audris Mockus. 2018. Are software dependency supply chain metrics useful in predicting change of popularity of npm packages?. In Proceedings of the 14th International Conference on Predictive Models and Data Analytics in Software Engineering. IEEE, 66–69.
  14. Towards measuring supply chain attacks on package managers for interpreted languages. In Network and Distributed Systems Security (NDSS) Symposium. IEEE.
  15. Containing malicious package updates in npm with a lightweight permission system. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 1334–1346.
  16. Postmortem for malicious packages. https://eslint.org/blog/2018/07/postmortem-for-malicious-package-publishes.
  17. Django Software Foundation. 2005. Django makes it easier to build better web apps more quickly and with less code. https://www.djangoproject.com/
  18. Python Software Foundation. 2020a. The Python Package Index (PyPI) is a repository of software for the Python programming language. https://pypi.org.
  19. The Apache Software Foundation. 2020b. Apache Maven is a software project management and comprehension tool. https://maven.apache.org/.
  20. The evolution of the R software ecosystem. In 2013 17th European Conference on Software Maintenance and Reengineering. IEEE, 243–252.
  21. GitHub. 2023. Github Security Advisory Database. . https://github.com/advisories.
  22. An Empirical Study of Malicious Code In PyPI Ecosystem. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 166–177.
  23. Jaap Kabbedijk and Slinger Jansen. 2011. Steering insight: An exploration of the ruby software ecosystem. In Software Business: Second International Conference, ICSOB 2011, Brussels, Belgium, June 8-10, 2011. Proceedings 2. Springer, 44–55.
  24. Structure and evolution of package dependency networks. In 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR). IEEE, 102–112.
  25. J. Koljonen. 2019. Warning! is rest-client 1.6.13 hijacked? https://github.com/rest-client/rest-client/issues/713.
  26. Sok: Taxonomy of attacks on open-source software supply chains. In 2023 IEEE Symposium on Security and Privacy (SP). IEEE, 1509–1526.
  27. Towards the Detection of Malicious Java Packages. In Proceedings of the 2022 ACM Workshop on Software Supply Chain Offensive Research and Ecosystem Defenses (Los Angeles, CA, USA) (SCORED’22). Association for Computing Machinery, New York, NY, USA, 63 – 72. https://doi.org/10.1145/3560835.3564548
  28. Thou shalt not depend on me: Analysing the use of outdated javascript libraries on the web. arXiv preprint arXiv:1811.00918 (2018).
  29. Arbitrar: User-guided api misuse detection. In 2021 IEEE Symposium on Security and Privacy (SP). IEEE, 1400–1415.
  30. Yuxing Ma. 2018. Constructing supply chains in open source software. In 2018 IEEE/ACM 40th International Conference on Software Engineering: Companion (ICSE-Companion). IEEE, 458–459.
  31. World of code: an infrastructure for mining the universe of open source VCS data. In 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR). IEEE, 143–154.
  32. Microsoft. 2020. NuGet is the package manager for .NET. https://www.nuget.org/.
  33. PyPI mirror in tsinghua. accessible by 2023. TUNA PyPI mirror for users in China. https://pypi.tuna.tsinghua.edu.cn/.
  34. NPM. 2020. npm is the package manager for Node.js. https://www.npmjs.com/.
  35. Marc Ohm. 2020. Backstabber’s Knife Collection. https://dasfreak.github.io/Backstabbers-Knife-Collection/.
  36. Backstabber’s Knife Collection: A Review of Open Source Software Supply Chain Attacks. In Detection of Intrusions and Malware, and Vulnerability Assessment, Clémentine Maurice, Leyla Bilge, Gianluca Stringhini, and Nuno Neves (Eds.). Springer International Publishing, Cham, 23–43.
  37. Backstabber’s knife collection: A review of open source software supply chain attacks. In International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment. Springer, 23–43.
  38. PRIVATE PACKAGIST. 2020. Packagist is the main Composer repository. https://packagist.org/.
  39. Preliminary Findings on FOSS Dependencies and Security. (2020).
  40. A qualitative study of dependency management and its security implications. In Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security. 1513–1531.
  41. Conflictjs: finding and understanding conflicts between javascript libraries. In Proceedings of the 40th International Conference on Software Engineering. 741–751.
  42. Massimo Di Pierro. 2007. Ffull-stack framework for rapid development web applications. https://github.com/web2py/web2py
  43. Beyond metadata: Code-centric and usage-based analysis of known vulnerabilities in open-source software. In 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 449–460.
  44. Pytorch. 2018. An open source machine learning framework that accelerates the path from research prototyping to production deployment. https://pytorch.org/.
  45. Malicious repositories detection with adversarial heterogeneous graph contrastive learning. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management. 1645–1654.
  46. Call graph construction for java libraries. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering. 474–486.
  47. How Do Developers React to API Deprecation? The Case of a Smalltalk Ecosystem. In Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering. Association for Computing Machinery. https://doi.org/10.1145/2393596.2393662
  48. Armin Ronacher. 2010. A lightweight WSGI web application framework. https://github.com/pallets/flask
  49. Scikit-learn. 2007. Machine Learning Library for the Python Language. http://scikit-learn.org/stable/index.html.
  50. Adriana Sejfia and Max Schäfer. 2022. Practical Automated Detection of Malicious Npm Packages. In Proceedings of the 44th International Conference on Software Engineering (Pittsburgh, Pennsylvania) (ICSE ’22). Association for Computing Machinery, New York, NY, USA, 1681 – 1692. https://doi.org/10.1145/3510003.3510104
  51. Adriana Sejfia and Max Schäfer. 2022. Practical automated detection of malicious npm packages. In Proceedings of the 44th International Conference on Software Engineering. 1681–1692.
  52. Alexander Serebrenik and Tom Mens. 2015. Challenges in software ecosystems research. In Proceedings of the 2015 European Conference on Software Architecture Workshops. 1–6.
  53. Sonatype. 2021. State of the software supply chain. https://www.sonatype.com/resources/state-of-the-software-supply-chain-2021.
  54. TUNA. accessible by 2023. TUNA RubyGems mirror aiming to accelerate installations in China. https://mirrors.tuna.tsinghua.edu.cn/rubygems/.
  55. USTC. accessible by 2023. PyPI mirror for users in China. https://pypi.mirrors.ustc.edu.cn/.
  56. USTC-NPM. accessible by 2023. USTC NPM mirror for users in China. https://mirrors.ustc.edu.cn/npm/.
  57. Security issues in language-based sofware ecosystems. arXiv preprint arXiv:1903.02613 (2019).
  58. Bad Snakes: Understanding and Improving Python Package Index Malware Scanning. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 499–511.
  59. Watchman: Monitoring dependency conflicts for python library ecosystem. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering. 125–135.
  60. {{\{{V0Finder}}\}}: Discovering the Correct Origin of Publicly Reported Software Vulnerabilities. In 30th USENIX Security Symposium (USENIX Security 21). 3041–3058.
  61. Wolf at the Door: Preventing Install-Time Attacks in Npm with Latch. In Proceedings of the 2022 ACM on Asia Conference on Computer and Communications Security (Nagasaki, Japan) (ASIA CCS ’22). Association for Computing Machinery, New York, NY, USA, 1139 – 1153. https://doi.org/10.1145/3488932.3523262
  62. Abusing hidden properties to attack the node. js ecosystem. In 30th USENIX Security Symposium (USENIX Security 21). 2951–2968.
  63. What are Weak Links in the npm Supply Chain? arXiv preprint arXiv:2112.10165 (2021).
  64. Cyber-guided deep neural network for malicious repository detection in GitHub. In 2020 IEEE International Conference on Knowledge Graph (ICKG). IEEE, 458–465.
  65. Small world with high risks: A study of security threats in the npm ecosystem. In 28th USENIX Security Symposium (USENIX Security 19). 995–1010.
  66. Daniel Zwillinger and Stephen Kokoska. 1999. CRC standard probability and statistics tables and formulae. Crc Press.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com