Detecting Security-Relevant Methods using Multi-label Machine Learning (2403.07501v1)
Abstract: To detect security vulnerabilities, static analysis tools need to be configured with security-relevant methods. Current approaches can automatically identify such methods using binary relevance machine learning approaches. However, they ignore dependencies among security-relevant methods, over-generalize and perform poorly in practice. Additionally, users have to nevertheless manually configure static analysis tools using the detected methods. Based on feedback from users and our observations, the excessive manual steps can often be tedious, error-prone and counter-intuitive. In this paper, we present Dev-Assist, an IntelliJ IDEA plugin that detects security-relevant methods using a multi-label machine learning approach that considers dependencies among labels. The plugin can automatically generate configurations for static analysis tools, run the static analysis, and show the results in IntelliJ IDEA. Our experiments reveal that Dev-Assist's machine learning approach has a higher F1-Measure than related approaches. Moreover, the plugin reduces and simplifies the manual effort required when configuring and using static analysis tools.
- “Dos and Don’ts of Machine Learning in Computer Security” 46.23.01; LK 01 In Proc. of the USENIX Security Symposium 2022, 2022
- Philippe Arteau, David Formánek and Tomáš Polešovský “Find-sec-bugs Resources”, https://github.com/find-sec-bugs/find-sec-bugs/tree/master/findsecbugs-plugin/src/main/resources/injection-sinks, 2020
- Steven Arzt, Siegfried Rasthofer and Eric Bodden “SuSi: A Tool for the Fully Automated Classification and Categorization of Android Sources and Sinks” In Network and Distributed System Security Symposium 2013, NDSS’13, 2013
- Edited Michael C.Fanning and Laurence J. Golding “Static Analysis Results Interchange Format (SARIF) Version 2.1.0 Plus Errata 01” In 2019 International Engineering Conference (IEC), 2023 URL: https://docs.oasis-open.org/sarif/sarif/v2.1.0/errata01/os/sarif-v2.1.0-errata01-os-complete.html.%20Latest%20stage:%20https://docs.oasis-open.org/sarif/sarif/v2.1.0/sarif-v2.1.0.html.
- The MITRE Corporation “CWE Top 25 Most Dangerous Software Weaknesses” Accessed on November 14, 2023, 2023 URL: https://cwe.mitre.org/top25/
- Hackerone “Hacker-Powered Security Report: Industry Insights” Accessed on November 14, 2023, 2022 URL: https://www.hackerone.com/reports/6th-annual-hacker-powered-security-report
- Bart Jacobs and Coen De Roover “Summer School on Security Testing and Verification”, 2022 URL: https://cybersecurity-research.be/summer-school-security-testing-and-verification-2022
- JetBrains “About Qodana”, 2023 URL: https://www.jetbrains.com/help/qodana/about-qodana.html
- JetBrains “Actions”, 2023 URL: https://plugins.jetbrains.com/docs/intellij/basic-action-system.html
- JetBrains “Problems tool window”, 2023 URL: https://www.jetbrains.com/help/idea/problems-tool-window.htmll
- JetBrains “Program Structure Interface (PSI)”, 2022 URL: https://plugins.jetbrains.com/docs/intellij/psi.html
- JetBrains “Statistics: Product Versions in Use”, 2023 URL: https://plugins.jetbrains.com/docs/marketplace/product-versions-in-use-statistics.html
- Bartosz Krawczyk “Learning from imbalanced data: open challenges and future directions” In Progress in Artificial Intelligence 5.4 Springer, 2016, pp. 221–232
- “The IntelliJ Platform: A Framework for Building Plugins and Mining Software Data” In 2021 36th IEEE/ACM International Conference on Automated Software Engineering Workshops (ASEW), 2021, pp. 14–17 DOI: 10.1109/ASEW52652.2021.00016
- “The Soot framework for Java program analysis: a retrospective” In Cetus Users and Compiler Infrastructure Workshop (CETUS 2011), 2011
- Niels Landwehr, Mark Hall and Eibe Frank “Logistic model trees” In Machine learning 59 Springer, 2005, pp. 161–205
- OWASP “Andoid 13” Online; accessed December 2023, https://developer.android.com/about/versions/13, 2023
- OWASP “WebGoat” Online; accessed January 2020, https://github.com/WebGoat/WebGoat, 2020
- Goran Piskachev, Lisa Nguyen Quang Do and Eric Bodden “Codebase-Adaptive Detection of Security-Relevant Methods” In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2019 Beijing, China: Association for Computing Machinery, 2019, pp. 181–191
- “SWANAssist: Semi-Automated Detection of Code-Specific, Security-Relevant Methods” In Proceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering, ASE ’19 San Diego, California: IEEE Press, 2020, pp. 1094–1097 DOI: 10.1109/ASE.2019.00110
- Goran Piskachev, Ranjith Krishnamurthy and Eric Bodden “SecuCheck: Engineering configurable taint analysis for software developers” In 2021 IEEE 21st International Working Conference on Source Code Analysis and Manipulation (SCAM), 2021, pp. 24–29 DOI: 10.1109/SCAM52516.2021.00012
- “Fluently Specifying Taint-Flow Queries with FluentTQL” In Empirical Softw. Engg. 27.5 USA: Kluwer Academic Publishers, 2022 DOI: 10.1007/s10664-022-10165-y
- Wisam A. Qader, Musa M. Ameen and Bilal I. Ahmed “An Overview of Bag of Words;Importance, Implementation, Applications, and Challenges” In 2019 International Engineering Conference (IEC), 2019, pp. 200–204 DOI: 10.1109/IEC47844.2019.8950616
- Jesse Read, Bernhard Pfahringer and Geoff Holmes “Multi-label Classification Using Ensembles of Pruned Sets” In 2008 Eighth IEEE International Conference on Data Mining, 2008, pp. 995–1000 DOI: 10.1109/ICDM.2008.74
- “MEKA: A Multi-label/Multi-target Extension to Weka” In Journal of Machine Learning Research 17.21, 2016, pp. 1–5 URL: http://jmlr.org/papers/v17/12-164.html
- L Sampaio “Which methods should be considered “Sources”, “Sinks” or “Sanitization”?” Accessed 05.03.2020, https://thecodemaster.net/methods-considered-sources-sinks-sanitization/, 2014
- Darius Sas, Marco Bessi and Francesca A. Fontana “Automatic Detection of Sources and Sinks in Arbitrary Java Libraries” In 2018 IEEE 18th International Working Conference on Source Code Analysis and Manipulation (SCAM), 2018, pp. 103–112
- “Multi-label classification: An overview” In International Journal of Data Warehousing and Mining (IJDWM) 3.3 IGI Global, 2007, pp. 1–13
- “OWASP code review guide v1. 1” In The OWASP Foundation Guidelines, 2008
- Marcel Wever, Felix Mohr and Eyke Hüllermeier “Automated Multi-Label Classification based on ML-Plan”, 2018 arXiv:1811.04060 [cs.LG]
- “AutoML for multi-label classification: Overview and empirical evaluation” In IEEE transactions on pattern analysis and machine intelligence 43.9 IEEE, 2021, pp. 3037–3054
- “Data Mining, Fourth Edition: Practical Machine Learning Tools and Techniques” San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2016
- “Binary Relevance for Multi-Label Learning: An Overview” In Front. Comput. Sci. 12.2 Berlin, Heidelberg: Springer-Verlag, 2018, pp. 191–202 DOI: 10.1007/s11704-017-7031-7
- “Binary relevance for multi-label learning: an overview” In Frontiers of Computer Science 12 Springer, 2018, pp. 191–202
- “A review on multi-label learning algorithms” In IEEE transactions on knowledge and data engineering 26.8 IEEE, 2013, pp. 1819–1837