Papers
Topics
Authors
Recent
Search
2000 character limit reached

Malicious Package Detection using Metadata Information

Published 12 Feb 2024 in cs.CR | (2402.07444v1)

Abstract: Protecting software supply chains from malicious packages is paramount in the evolving landscape of software development. Attacks on the software supply chain involve attackers injecting harmful software into commonly used packages or libraries in a software repository. For instance, JavaScript uses Node Package Manager (NPM), and Python uses Python Package Index (PyPi) as their respective package repositories. In the past, NPM has had vulnerabilities such as the event-stream incident, where a malicious package was introduced into a popular NPM package, potentially impacting a wide range of projects. As the integration of third-party packages becomes increasingly ubiquitous in modern software development, accelerating the creation and deployment of applications, the need for a robust detection mechanism has become critical. On the other hand, due to the sheer volume of new packages being released daily, the task of identifying malicious packages presents a significant challenge. To address this issue, in this paper, we introduce a metadata-based malicious package detection model, MeMPtec. This model extracts a set of features from package metadata information. These extracted features are classified as either easy-to-manipulate (ETM) or difficult-to-manipulate (DTM) features based on monotonicity and restricted control properties. By utilising these metadata features, not only do we improve the effectiveness of detecting malicious packages, but also we demonstrate its resistance to adversarial attacks in comparison with existing state-of-the-art. Our experiments indicate a significant reduction in both false positives (up to 97.56%) and false negatives (up to 91.86%).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Simplifying the search of npm packages. Information and Software Technology 126 (2020), 106365.
  2. Experimental evaluation of a multi-layer feed-forward artificial neural network classifier for network intrusion detection system. In 2017 International Conference on New Trends in Computing Sciences (ICTCS). IEEE, 167–172.
  3. Blake Barnes-Cook and Timothy O’Shea. 2022. Scalable Wireless Anomaly Detection with Generative-LSTMs on RF Post-Detection Metadata. In 2022 IEEE Wireless Communications and Networking Conference (WCNC). IEEE, 483–488.
  4. Jerome H Friedman. 2001. Greedy function approximation: a gradient boosting machine. Annals of statistics (2001), 1189–1232.
  5. Detecting suspicious package updates. In 2019 IEEE/ACM 41st International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER). IEEE, 13–16.
  6. Anomalicious: Automated detection of anomalous and potentially malicious commits on github. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 258–267.
  7. Samiul Islam and Saman Hassanzadeh Amin. 2020. Prediction of probable backorder scenarios in the supply chain using Distributed Random Forest and Gradient Boosting Machine learning techniques. Journal of Big Data 7, 1 (2020), 1–22.
  8. Attack classification of an intrusion detection system using deep learning and hyperparameter optimization. Journal of Information Security and Applications 58 (2021), 102804.
  9. Tysen Leckie and Alec Yasinsac. 2004. Metadata for anomaly-based security protocol attack deduction. IEEE Transactions on Knowledge and Data Engineering 16, 9 (2004), 1157–1168.
  10. Demystifying the vulnerability propagation and its evolution via dependency trees in the npm ecosystem. In Proceedings of the 44th International Conference on Software Engineering. 672–684.
  11. Marlene Müller. 2012. Generalized linear models. Handbook of Computational Statistics: Concepts and Methods (2012), 681–709.
  12. Anomaly Detection using Network Metadata. International Journal of Advanced Computer Science and Applications 13, 5 (2022).
  13. Explanation of machine learning models using improved shapley additive explanation. In Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics. 546–546.
  14. Npm, Inc. 2023. State Of Npm 2023: The Overview. Online. https://blog.sandworm.dev/series/state-of-npm-2023 Accessed on 2023-9-12.
  15. On the feasibility of supervised machine learning for the detection of malicious software packages. In Proceedings of the 17th International Conference on Availability, Reliability and Security. 1–10.
  16. Backstabber’s Knife Collection: A Review of Open Source Software Supply Chain Attacks. In International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment. Springer.
  17. Brian Pfretzschner and Lotfi ben Othmane. 2017. Identification of dependency-based attacks on node. js. In Proceedings of the 12th International Conference on Availability, Reliability and Security. 1–6.
  18. Derek A Pisner and David M Schnyer. 2020. Support vector machine. In Machine learning. Elsevier, 101–121.
  19. On the feasibility of detecting injections in malicious npm packages. In Proceedings of the 17th International Conference on Availability, Reliability and Security. 1–8.
  20. Adriana Sejfia and Max Schäfer. 2022. Practical automated detection of malicious npm packages. In Proceedings of the 44th International Conference on Software Engineering. 1681–1692.
  21. Sonatype. 2019. 2019 State of the Software Supply Chain Report Reveals Best Practices From 36,000 Open Source Software Development Teams. https://www.sonatype.com/press-release-blog/2019-state-of-thesoftware-supply-chain-report-reveals-best-practices-from-36000-opensource-software-development-teams
  22. Coprotector: Protect open-source code against unauthorized training usage with data poisoning. In Proceedings of the ACM Web Conference 2022. 652–660.
  23. Synopsys. 2020. Synopsys 2020 Open Source Security and Risk Analysis Report. https://www.synopsys.com/content/dam/synopsys/sig-assets/reports/2020-ossra-report.pdf
  24. Defending against package typosquatting. In Network and System Security: 14th International Conference, NSS 2020, Melbourne, VIC, Australia, November 25–27, 2020, Proceedings 14. Springer, 112–131.
  25. Laurie Voss. 2018. npm and the future of JavaScript. https://slides.com/seldo/npmfuture-of-javascript.
  26. Duc-Ly Vu. 2021. PY2SRC: Towards the Automatic (and Reliable) Identification of Sources for PyPI Package. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 1394–1396.
  27. Typosquatting and combosquatting attacks on the python ecosystem. In 2020 ieee european symposium on security and privacy workshops (euros&pw). IEEE, 509–514.
  28. HiddenCPG: large-scale vulnerable clone detection using subgraph isomorphism of code property graphs. In Proceedings of the ACM Web Conference 2022. 755–766.
  29. Anomaly detection in seismic data–metadata using simple machine-learning models. Seismological Society of America 92, 4 (2021), 2627–2639.
  30. OpenSSF Scorecard: On the Path Toward Ecosystem-Wide Automated Security Metrics. IEEE Security & Privacy (2023).
  31. What are weak links in the npm supply chain?. In Proceedings of the 44th International Conference on Software Engineering: Software Engineering in Practice. 331–340.
  32. Graph embedding for recommendation against attribute inference attacks. In Proceedings of the Web Conference 2021. 3002–3014.
  33. Multilayer Feedforward Artificial Neural Network. YellowRiver Water Conservancy Press: Zhengzhou, China (1999).
  34. DeepSyslog: Deep Anomaly Detection on Syslog Using Sentence Embedding and Metadata. IEEE Transactions on Information Forensics and Security 17 (2022), 3051–3061.
  35. IFSpard: An information fusion-based framework for spam review detection. In Proceedings of the Web Conference 2021. 507–517.
  36. Small World with High Risks: A Study of Security Threats in the npm Ecosystem.. In USENIX security symposium, Vol. 17.
Citations (2)

Summary

  • The paper presents MeMPtec, a metadata-driven model that improves detection of malicious packages in software repositories by leveraging ETM and DTM features.
  • It systematically categorizes metadata into six groups and employs machine learning to achieve up to 97.5% reduction in false positives.
  • Experimental evaluation confirms the model's resilience against adversarial manipulation, making it a promising tool for securing software supply chains.

Metadata-based Malicious Package Detection: MeMPtec

This essay examines the key contributions and findings of the paper titled "Malicious Package Detection using Metadata Information" (2402.07444) which introduces a metadata-based model called MeMPtec for detecting malicious packages in software repositories.

Introduction and Motivation

The proliferation of Free and Open-Source Software (FOSS) has made software supply chains increasingly vulnerable to attacks, as demonstrated by incidents in repositories like NPM and PyPi. Given the critical role software packages play, especially as dependencies, protecting these repositories becomes paramount. Traditional methods focusing solely on code analysis are inadequate in coping with the sheer volume of packages. The proposed MeMPtec model leverages package metadata to enhance detection accuracy while being resilient to adversarial manipulation.

Metadata-based Detection Model: MeMPtec

Model Architecture

At its core, MeMPtec is designed to identify malicious packages by extracting metadata features. These features are categorized into two types based on their susceptibility to manipulation: easy-to-manipulate (ETM) and difficult-to-manipulate (DTM). The differentiation is critical as it ensures robust detection even when some metadata might be artificially altered by adversaries. Figure 1

Figure 1: Proposed Metadata-based Malicious Package Detection (MeMPtec) model architecture.

Feature Categorization

The metadata features are systematically categorized into six groups: Descriptive Information, Stakeholder Information, Dependency Information, Provenance Information, Repository Information, and Context Information. Each category provides a nuanced understanding of potential indicators of malicious activity, from author information to repository links.

Feature Extraction and Classification

The extraction process aims to transform raw metadata into quantifiable features. ETM features include characteristics like the presence of special characters in package names, while DTM features focus on aspects like package interaction and temporal properties that are inherently resistant to adversarial changes (e.g., number of package stars, timestamps).

MeMPtec applies machine learning algorithms to these extracted features, allowing it to classify packages as benign or malicious effectively. The model's reliance on both feature types ensures a more comprehensive detection mechanism.

Experimental Evaluation

Performance Metrics

The MeMPtec model's performance was rigorously evaluated using metrics such as precision, recall, F1-score, accuracy, and RMSE. These metrics were chosen to offer a holistic view of the model's efficacy in both balanced and imbalanced datasets.

Results and Analysis

The experiments demonstrated MeMPtec's superior performance over existing feature selection methods, achieving significant reductions in false positives by up to 97.5% and false negatives by up to 91.86%. This is critical in security contexts where every undetected malicious package poses substantial risks. Figure 2

Figure 2

Figure 2

Figure 2

Figure 2: False Positive and False Negative numbers comparison on balanced and imbalanced datasets.

Additionally, the robustness analysis confirmed MeMPtec's resilience against adversarial attacks. The model maintained high accuracy even when metadata was manipulated, showcasing its strength in real-world applications where attackers may attempt to deceive detection systems.

Implications and Future Work

The introduction of MeMPtec has significant implications for software security. By utilizing metadata, this approach circumvents some limitations of code-based analyses, offering a scalable and resilient solution for package repositories. Future research can explore expanding this model to other languages and repository types, as well as improving its adaptive learning capabilities to preemptively adjust to new attack vectors.

Conclusion

MeMPtec represents a significant advancement in detecting malicious packages through metadata analysis. Its methodological rigor and experimental validation underscore the potential benefits of integrating metadata insights into broader cybersecurity strategies. As software ecosystems grow more complex, such models may become integral to protecting digital infrastructure.

In conclusion, MeMPtec not only enhances the detection of malicious packages but also sets a precedent for future work in metadata-driven security applications, marking a step forward in the ongoing effort to secure software supply chains.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.