Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Leveraging Large Language Models to Detect npm Malicious Packages (2403.12196v4)

Published 18 Mar 2024 in cs.CR and cs.AI

Abstract: Existing malicious code detection techniques demand the integration of multiple tools to detect different malware patterns, often suffering from high misclassification rates. Therefore, malicious code detection techniques could be enhanced by adopting advanced, more automated approaches to achieve high accuracy and a low misclassification rate. The goal of this study is to aid security analysts in detecting malicious packages by empirically studying the effectiveness of LLMs in detecting malicious code. We present SocketAI, a malicious code review workflow to detect malicious code. To evaluate the effectiveness of SocketAI, we leverage a benchmark dataset of 5,115 npm packages, of which 2,180 packages have malicious code. We conducted a baseline comparison of GPT-3 and GPT-4 models with the state-of-the-art CodeQL static analysis tool, using 39 custom CodeQL rules developed in prior research to detect malicious Javascript code. We also compare the effectiveness of static analysis as a pre-screener with SocketAI workflow, measuring the number of files that need to be analyzed. and the associated costs. Additionally, we performed a qualitative study to understand the types of malicious activities detected or missed by our workflow. Our baseline comparison demonstrates a 16% and 9% improvement over static analysis in precision and F1 scores, respectively. GPT-4 achieves higher accuracy with 99% precision and 97% F1 scores, while GPT-3 offers a more cost-effective balance at 91% precision and 94% F1 scores. Pre-screening files with a static analyzer reduces the number of files requiring LLM analysis by 77.9% and decreases costs by 60.9% for GPT-3 and 76.1% for GPT-4. Our qualitative analysis identified data theft, execution of arbitrary code, and suspicious domain categories as the top detected malicious packages.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (61)
  1. Acorn. https://github.com/acornjs/acorn. Last accessed: January 28, 2024.
  2. OpenAI. https://openai.com/pricing. Last accessed: January 28, 2024.
  3. OpenAI. https://openai.com. Last accessed: January 28, 2024.
  4. Model collapse. https://machinelearning.wtf/terms/mode-collapse/, 2022. Last accessed: January 28, 2024.
  5. Automatic semantic augmentation of language model prompts (for code summarization).
  6. Amatriain, X. Prompt design and engineering: Introduction and advanced methods. arXiv preprint arXiv:2401.14423 (2024).
  7. Humans vs. machines in malware classification. Proc. of USENIX-23 (2023).
  8. Large language models suffer from their own output: An analysis of the self-consuming training loop. arXiv preprint arXiv:2311.16822 (2023).
  9. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
  10. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).
  11. Evaluation of chatgpt model for vulnerability detection. arXiv preprint arXiv:2304.07232 (2023).
  12. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24, 240 (2023), 1–113.
  13. Claburn, T. Hundreds of thousands of dollars in crypto stolen after ledger code poisoned. https://www.theregister.com/2023/12/16/ledger_crypto_conect_kit/, 2023. Last accessed: January 28, 2024.
  14. DataDog. malicious-software-packages-dataset, 2023. Last accessed: January 28, 2024.
  15. Towards measuring supply chain attacks on package managers for interpreted languages. arXiv preprint arXiv:2002.01139 (2020).
  16. ENISA. Enisa threat landscape 2022, 2022. Last accessed: January 28, 2024.
  17. et al., S. The curse of recursion: Training on generated data makes models forget. https://arxiv.org/abs/2305.17493 (2023).
  18. Firstbrook, P. 7 top trends in cybersecurity for 2022, 2022. Last accessed: January 28, 2024.
  19. Linevul: A transformer-based line-level vulnerability prediction. In Proceedings of the 19th International Conference on Mining Software Repositories (2022), pp. 608–620.
  20. Detecting suspicious package updates. In 2019 IEEE/ACM 41st International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER) (2019), IEEE, pp. 13–16.
  21. Garrood, H. Malicious code in the purescript npm installer. https://harry.garrood.me/blog/malicious-code-in-purescript-npm-installer/, 2019. Last accessed: January 28, 2024.
  22. Anomalicious: Automated detection of anomalous and potentially malicious commits on github. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP) (2021), IEEE, pp. 258–267.
  23. Gooding, S. Ledger connect-kit supply chain attack hits decentralized crypto apps with wallet-draining code. https://socket.dev/blog/ledger-connect-kit-supply-chain-attack-wallet-drainer, 2023. Last accessed: January 28, 2024.
  24. An empirical study of malicious code in pypi ecosystem. arXiv preprint arXiv:2309.11021 (2023).
  25. House, T. W. Executive order on improving the nation’s cybersecurity, 2021. Last accessed: January 28, 2024.
  26. janus. Mysteries of mode collapse. https://www.lesswrong.com/posts/t9svvNPNmFf5Qa3TA/mysteries-of-mode-collapse#What_contexts_cause_mode_collapse_, 2022. Last accessed: January 28, 2024.
  27. Large language models are zero-shot reasoners, 2022. URL https://arxiv. org/abs/2205.11916.
  28. Better zero-shot reasoning with role-play prompting. arXiv preprint arXiv:2308.07702 (2023).
  29. Risk explorer for software supply chains: Understanding the attack surface of open-source based software development. In Proceedings of the 2022 ACM Workshop on Software Supply Chain Offensive Research and Ecosystem Defenses (2022), pp. 35–36.
  30. Latpate, R. V. Inverse adaptive stratified random sampling. Statistical Methods and Applications in Forestry and Environmental Sciences (2020), 47–55.
  31. Malicious packages lurking in user-friendly python package index. In 2021 IEEE 20th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom) (2021), IEEE, pp. 606–613.
  32. Summary of chatgpt-related research and perspective towards the future of large language models. Meta-Radiology (2023), 100017.
  33. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651 (2023).
  34. npm malware. Attacker trying prompt injection and failing. https://twitter.com/npm_malware/status/1760035541909929995, 2024. Last accessed: Feb 28, 2024.
  35. On the feasibility of supervised machine learning for the detection of malicious software packages. In Proceedings of the 17th International Conference on Availability, Reliability and Security (2022), pp. 1–10.
  36. Towards Detection of Malicious Software Packages Through Code Reuse by Malevolent Actors. Gesellschaft für Informatik, Bonn, 2022.
  37. Backstabber’s knife collection: A review of open source software supply chain attacks. 23–43.
  38. Sok: Practical detection of software supply chain attacks. In Proceedings of the 18th International Conference on Availability, Reliability and Security (2023), pp. 1–11.
  39. OpenAI. Api rate limit advice. https://help.openai.com/en/articles/6891753-api-rate-limit-advice, 2024. Last accessed: January 28, 2024.
  40. Software vulnerability detection using large language models. In 2023 IEEE 34th International Symposium on Software Reliability Engineering Workshops (ISSREW) (2023), IEEE, pp. 112–119.
  41. Is chatgpt a general-purpose natural language processing task solver? arXiv preprint arXiv:2302.06476 (2023).
  42. On the feasibility of detecting injections in malicious npm packages. In Proceedings of the 17th International Conference on Availability, Reliability and Security (2022), pp. 1–8.
  43. Practical automated detection of malicious npm packages. In Proceedings of the 44th International Conference on Software Engineering (2022), pp. 1681–1692.
  44. An analysis of the automatic bug fixing performance of chatgpt. arXiv preprint arXiv:2301.08653 (2023).
  45. Spellbound: Defending against package typosquatting. arXiv preprint arXiv:2003.03471 (2020).
  46. Transformer-based language models for software vulnerability detection: Performance, model’s security and platforms. arXiv preprint arXiv:2204.03214 (2022).
  47. Lastpymile: identifying the discrepancy between sources and packages. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (2021), pp. 780–792.
  48. Bad snakes: Understanding and improving python package index malware scanning. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) (2023), IEEE, pp. 499–511.
  49. Towards using source code repositories to identify software supply chain attacks. In Proceedings of the 2020 ACM SIGSAC conference on computer and communications security (2020), pp. 2093–2095.
  50. Wang, X. On the feasibility of detecting software supply chain attacks. In MILCOM 2021-2021 IEEE Military Communications Conference (MILCOM) (2021), IEEE, pp. 458–463.
  51. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35 (2022), 24824–24837.
  52. A prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv preprint arXiv:2302.11382 (2023).
  53. Chatgpt prompt patterns for improving code quality, refactoring, requirements elicitation, and software design. arXiv preprint arXiv:2303.07839 (2023).
  54. Harnessing the power of llms in practice: A survey on chatgpt and beyond. arXiv preprint arXiv:2304.13712 (2023).
  55. Prompt engineering a prompt engineer. arXiv preprint arXiv:2311.05661 (2023).
  56. Security code review by llms: A deep dive into responses. arXiv preprint arXiv:2401.16310 (2024).
  57. Malwarebench: Malware samples are not enough. 2024 IEEE/ACM 21st International Conference on Mining Software Repositories (MSR) (2024).
  58. What are weak links in the npm supply chain? In Proceedings of the 44th International Conference on Software Engineering: Software Engineering in Practice (2022), pp. 331–340.
  59. Star: Self-taught reasoner. arXiv preprint arXiv:2203.14465 (2022).
  60. Prefer: Prompt ensemble learning via feedback-reflect-refine. arXiv preprint arXiv:2308.12033 (2023).
  61. Can chatgpt understand too? a comparative study on chatgpt and fine-tuned bert. arXiv preprint arXiv:2302.10198 (2023).

Summary

We haven't generated a summary for this paper yet.