AssetHarvester: A Static Analysis Tool for Detecting Secret-Asset Pairs in Software Artifacts (2403.19072v2)
Abstract: GitGuardian monitored secrets exposure in public GitHub repositories and reported that developers leaked over 12 million secrets (database and other credentials) in 2023, indicating a 113% surge from 2021. Despite the availability of secret detection tools, developers ignore the tools' reported warnings because of false positives (25%-99%). However, each secret protects assets of different values accessible through asset identifiers (a DNS name and a public or private IP address). The asset information for a secret can aid developers in filtering false positives and prioritizing secret removal from the source code. However, existing secret detection tools do not provide the asset information, thus presenting difficulty to developers in filtering secrets only by looking at the secret value or finding the assets manually for each reported secret. The goal of our study is to aid software practitioners in prioritizing secrets removal by providing the assets information protected by the secrets through our novel static analysis tool. We present AssetHarvester, a static analysis tool to detect secret-asset pairs in a repository. Since the location of the asset can be distant from where the secret is defined, we investigated secret-asset co-location patterns and found four patterns. To identify the secret-asset pairs of the four patterns, we utilized three approaches (pattern matching, data flow analysis, and fast-approximation heuristics). We curated a benchmark of 1,791 secret-asset pairs of four database types extracted from 188 public GitHub repositories to evaluate the performance of AssetHarvester. AssetHarvester demonstrates precision of (97%), recall (90%), and F1-score (94%) in detecting secret-asset pairs. Our findings indicate that data flow analysis employed in AssetHarvester detects secret-asset pairs with 0% false positives and aids in improving recall of secret detection tools.
- “The State of Secrets Sprawl 2024,” https://www.gitguardian.com/state-of-secrets-sprawl-report-2024, [Online; accessed March 17, 2024].
- M. Meli, M. R. McNiece, and B. Reaves, “How bad can it git? characterizing secret leakage in public github repositories.” in NDSS, 2019.
- Cybernews Team, “Thousands of Android apps leak hard-coded secrets, research shows,” https://cybernews.com/security/android-apps-leak-hardcoded-secrets, 2022, [Online; accessed March 12, 2024].
- M. Jackson, “Uber Breach 2022 – Everything You Need to Know,” https://blog.gitguardian.com/uber-breach-2022, [Online; accessed March 10, 2024].
- “TruffleHog,” https://github.com/trufflesecurity/truffleHog, [Online; accessed February 23, 2024].
- “GGShield,” https://github.com/GitGuardian/ggshield, [Online; accessed March 13, 2024].
- S. K. Basak, J. Cox, B. Reaves, and L. Williams, “A comparative study of software secrets reporting by secret detection tools,” in 2023 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), 2023, pp. 1–12.
- Hadjy, Paul, “What Is Alert Fatigue? 4 Ways to Mitigate It and Prevent Burnout,” https://learn.g2.com/alert-fatigue, [Online; accessed March 12, 2024].
- “AssetBench and AssetHarvester Artifacts,” https://figshare.com/s/c8bf9140d5a4fde44a87, [Online; accessed March 20, 2024].
- “Stack Overflow Developer Survey, 2023,” https://survey.stackoverflow.co/2023/#most-popular-technologies-database, [Online; accessed February 14, 2024].
- “PostgreSQL,” https://www.postgresql.org, [Online; accessed February 14, 2024].
- “MySQL,” https://www.mysql.com, [Online; accessed March 14, 2024].
- “SQLite,” https://www.sqlite.org, [Online; accessed February 14, 2024].
- “MongoDB,” https://www.mongodb.com, [Online; accessed February 14, 2024].
- “Microsoft SQL Server,” https://www.microsoft.com/en-us/sql-server/sql-server-2022, [Online; accessed February 14, 2024].
- S. K. Basak, L. Neil, B. Reaves, and L. Williams, “Secretbench: A dataset of software secrets,” in 2023 IEEE/ACM 20th International Conference on Mining Software Repositories (MSR), 2023, pp. 347–351.
- “GitHub on BigQuery: Analyze all the open source code,” https://cloud.google.com/blog/topics/public-datasets/github-on-bigquery-analyze-all-the-open-source-code, [Online; accessed February 14, 2024].
- “Gitleaks,” https://github.com/gitleaks/gitleaks, [Online; accessed February 18, 2024].
- J. Cohen, “A coefficient of agreement for nominal scales,” Educational and Psychological Measurement, vol. 20, no. 1, pp. 37–46, 1960. [Online]. Available: https://doi.org/10.1177/00131644600200010
- J. R. Landis and G. G. Koch, “The measurement of observer agreement for categorical data,” Biometrics, vol. 33, no. 1, pp. 159–174, 1977. [Online]. Available: http://www.jstor.org/stable/2529310
- Spencer EA, Brassey J, Mahtani K, “Recall Bias,” https://www.catalogueofbiases.org/biases/recall-bias, [Online; accessed March 11, 2024].
- “Setting your commit email address,” https://docs.github.com/en/account-and-profile/setting-up-and-managing-your-personal-account-on-github/managing-email-preferences/setting-your-commit-email-address, [Online; accessed March 11, 2024].
- Nunan D, Bankhead C, Aronson JK, “Selection Bias,” https://catalogofbias.org/biases/selection-bias, [Online; accessed March 11, 2024].
- “MySQL Connection String,” https://dev.mysql.com/doc/refman/8.0/en/connecting-using-uri-or-key-value-pairs.html, [Online; accessed February 16, 2024].
- “PostgreSQL Connection String,” https://www.postgresql.org/docs/current/libpq-connect.html#LIBPQ-CONNSTRING, [Online; accessed February 16, 2024].
- “MongoDB Connection String,” https://www.mongodb.com/docs/manual/reference/connection-string, [Online; accessed February 16, 2024].
- “Microsoft Open Database Connectivity,” https://learn.microsoft.com/en-us/sql/odbc/microsoft-open-database-connectivity-odbc, [Online; accessed February 16, 2024].
- J. A. Blakeley, “Ole db: a component dbms architecture,” in Proceedings of the twelfth international conference on data engineering. IEEE Computer Society, 1996, pp. 203–203.
- “Java JDBC API,” https://docs.oracle.com/javase/8/docs/technotes/guides/jdbc, [Online; accessed February 16, 2024].
- “JayDeBeApi,” https://pypi.org/project/JayDeBeApi, [Online; accessed February 16, 2024].
- “Named Capturing Group,” https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Regular_expressions/Named_capturing_group, [Online; accessed February 16, 2024].
- “re - Regular Expression Operations,” https://docs.python.org/3/library/re.html, [Online; accessed February 16, 2024].
- “GitPython,” https://github.com/gitpython-developers/GitPython, [Online; accessed February 16, 2024].
- A. Rahman and C. Parnin, “Detecting and characterizing propagation of security weaknesses in puppet-based infrastructure management,” IEEE Transactions on Software Engineering, vol. 49, no. 06, pp. 3536–3553, jun 2023.
- V. Garousi, M. Felderer, and M. V. Mäntylä, “Guidelines for including grey literature and conducting multivocal literature reviews in software engineering,” Information and Software Technology, vol. 106, pp. 101–121, 2019. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0950584918301939
- “peewee,” https://docs.peewee-orm.com/en/latest, [Online; accessed February 18, 2024].
- “SQLAlchemy,” https://docs.sqlalchemy.org/en/20, [Online; accessed February 18, 2024].
- “Object Relational Mapping (ORM),” https://www.theserverside.com/definition/object-relational-mapping-ORM, [Online; accessed February 19, 2024].
- “Positional and Keyword Arguments,” https://problemsolvingwithpython.com/07-Functions-and-Modules/07.07-Positional-and-Keyword-Arguments, [Online; accessed February 19, 2024].
- “aiomysql,” https://aiomysql.readthedocs.io/en/stable, [Online; accessed February 18, 2024].
- “mysql-connector,” https://dev.mysql.com/doc/connector-python/en, [Online; accessed February 18, 2024].
- “PyMySQL,” https://pymysql.readthedocs.io/en/latest, [Online; accessed February 18, 2024].
- “aiopg,” https://aiopg.readthedocs.io/en/stable, [Online; accessed February 18, 2024].
- “asyncpg,” https://magicstack.github.io/asyncpg/current, [Online; accessed February 18, 2024].
- “psycopg2,” https://pypi.org/project/psycopg2, [Online; accessed February 18, 2024].
- “pymongo,” https://pymongo.readthedocs.io/en/stable, [Online; accessed March 4, 2024].
- “pymssql,” https://www.pymssql.org, [Online; accessed March 2, 2024].
- “pyodbc,” https://pypi.org/project/pyodbc, [Online; accessed February 18, 2024].
- “CodeQL,” https://codeql.github.com, [Online; accessed March 4, 2024].
- O. d. Moor, M. Verbaere, E. Hajiyev, P. Avgustinov, T. Ekman, N. Ongkingco, D. Sereni, and J. Tibble, “Keynote address: .ql for source code analysis,” in Seventh IEEE International Working Conference on Source Code Analysis and Manipulation (SCAM 2007), 2007, pp. 3–16.
- “Using API graphs in Python,” https://codeql.github.com/docs/codeql-language-guides/using-api-graphs-in-python, [Online; accessed February 19, 2024].
- “PyYAML,” https://pyyaml.org/wiki/PyYAMLDocumentation, [Online; accessed February 19, 2024].
- “json - JSON encoder and decoder,” https://docs.python.org/3/library/json.html, [Online; accessed February 19, 2024].
- “xmltodict,” https://pypi.org/project/xmltodict, [Online; accessed February 19, 2024].
- “linecache — Random access to text lines,” https://docs.python.org/3/library/linecache.html, [Online; accessed February 20, 2024].
- W. E. Winkler, “String comparator metrics and enhanced decision rules in the fellegi-sunter model of record linkage.” 1990.
- “Python jellyfish package,” https://pypi.org/project/jellyfish, [Online; accessed February 21, 2024].
- “Oracle,” https://www.oracle.com, [Online; accessed March 12, 2024].
- “MariaDB,” https://mariadb.org, [Online; accessed March 11, 2024].
- “Python cx_Oracle,” https://oracle.github.io/python-cx_Oracle, [Online; accessed March 11, 2024].
- “Requests: HTTP for Humans,” https://requests.readthedocs.io/en/latest, [Online; accessed March 6, 2024].
- “smtplib — SMTP protocol client,” https://docs.python.org/3/library/smtplib.html, [Online; accessed March 11, 2024].
- “GitLab,” https://gitlab.com, [Online; accessed March 3, 2024].
- “Bitbucket,” https://bitbucket.org, [Online; accessed March 3, 2024].
- M. R. Rahman, A. Rahman, and L. Williams, “Share, but be aware: Security smells in python gists,” in 2019 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 2019, pp. 536–540.
- A. Rahman, C. Parnin, and L. Williams, “The seven sins: Security smells in infrastructure as code scripts,” in 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 2019, pp. 164–175.
- I. Koishybayev, A. Nahapetyan, R. Zachariah, S. Muralee, B. Reaves, A. Kapravelos, and A. Machiry, “Characterizing the security of github CI workflows,” in 31st USENIX Security Symposium (USENIX Security 22). Boston, MA: USENIX Association, Aug. 2022, pp. 2747–2763. [Online]. Available: https://www.usenix.org/conference/usenixsecurity22/presentation/koishybayev
- A. Rahman and L. Williams, “Different kind of smells: Security smells in infrastructure as code scripts,” IEEE Security & Privacy, vol. 19, no. 3, pp. 33–41, 2021.
- A. Krause, J. H. Klemmer, N. Huaman, D. Wermke, Y. Acar, and S. Fahl, “Pushed by accident: A {{\{{Mixed-Methods}}\}} study on strategies of handling secret information in source code repositories,” in 32nd USENIX Security Symposium (USENIX Security 23), 2023, pp. 2527–2544.
- S. K. Basak, L. Neil, B. Reaves, and L. Williams, “What are the practices for secret management in software artifacts?” in 2022 IEEE Secure Development Conference (SecDev), 2022, pp. 69–76.
- ——, “What challenges do developers face about checked-in secrets in software artifacts?” in 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), 2023, pp. 1635–1647.
- A. Saha, T. Denning, V. Srikumar, and S. K. Kasera, “Secrets in source code: Reducing false positives using machine learning,” in 2020 International Conference on COMmunication Systems & NETworkS (COMSNETS). IEEE, 2020, pp. 168–175.
- R. Feng, Z. Yan, S. Peng, and Y. Zhang, “Automated detection of password leakage from public github repositories,” in 2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE), 2022, pp. 175–186.
- E. Wen, J. Wang, and J. Dietrich, “Secrethunter: A large-scale secret scanner for public git repositories,” in 2022 IEEE International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom), 2022, pp. 123–130.
- A. V. Konygin, A. V. Kopnin, I. P. Mezentsev, and A. A. Pankratov, “Using bigrams to detect leaked secrets in source code.” in ENASE, 2023, pp. 589–596.
- “SpectralOps,” https://spectralops.io, [Online; accessed March 3, 2024].
- M. R. Rahman, N. Imtiaz, M.-A. Storey, and L. Williams, “Why secret detection tools are not enough: It’s not just about false positives-an industrial case study,” Empirical Software Engineering, vol. 27, no. 3, pp. 1–29, 2022.