PyRadar: Towards Automatically Retrieving and Validating Source Code Repository Information for PyPI Packages (2404.16565v1)
Abstract: A package's source code repository records the development history of the package, providing indispensable information for the use and risk monitoring of the package. However, a package release often misses its source code repository due to the separation of the package's development platform from its distribution platform. Existing tools retrieve the release's repository information from its metadata, which suffers from two limitations: the metadata may not contain or contain wrong information. Our analysis shows that existing tools can only retrieve repository information for up to 70.5% of PyPI releases. To address the limitations, this paper proposes PyRadar, a novel framework that utilizes the metadata and source distribution to retrieve and validate the repository information for PyPI releases. We start with an empirical study to compare four existing tools on 4,227,425 PyPI releases and analyze phantom files (files appearing in the release's distribution but not in the release's repository) in 14,375 correct package-repository links and 2,064 incorrect links. Based on the findings, we design PyRadar with three components, i.e., Metadata-based Retriever, Source Code Repository Validator, and Source Code-based Retriever. In particular, the Metadata-based Retriever combines best practices of existing tools and successfully retrieves repository information from the metadata for 72.1% of PyPI releases. The Source Code Repository Validator applies common machine learning algorithms on six crafted features and achieves an AUC of up to 0.995. The Source Code-based Retriever queries World of Code with the SHA-1 hashes of all Python files in the release's source distribution and retrieves repository information for 90.2% of packages in our dataset with an accuracy of 0.970. Both practitioners and researchers can employ the PyRadar to better use PyPI packages.
- Empirical Analysis of Security Vulnerabilities in Python Packages. In 2021 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). 446–457. https://doi.org/10.1109/SANER50967.2021.00048
- The promises and perils of mining git. In 2009 6th IEEE International Working Conference on Mining Software Repositories. 1–10. https://doi.org/10.1109/MSR.2009.5069475
- Understanding the Factors That Impact the Popularity of GitHub Repositories. In 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME). 334–344. https://doi.org/10.1109/ICSME.2016.31
- Leo Breiman. 2001. Random Forests. Machine Learning 45, 1 (01 Oct 2001), 5–32. https://doi.org/10.1023/A:1010933404324
- Scott Chacon and Ben Straub. 2023. Git - Git Objects. https://git-scm.com/book/en/v2/Git-Internals-Git-Objects. (Accessed on 09/14/2023).
- Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016, Balaji Krishnapuram, Mohak Shah, Alexander J. Smola, Charu C. Aggarwal, Dou Shen, and Rajeev Rastogi (Eds.). ACM, 785–794. https://doi.org/10.1145/2939672.2939785
- Roberto Di Cosmo and Stefano Zacchiroli. 2017. Software Heritage: Why and How to Preserve Software Source Code. In Proceedings of the 14th International Conference on Digital Preservation, iPRES 2017, Kyoto, Japan, September 25-29, 2017, Shoichiro Hara, Shigeo Sugimoto, and Makoto Goto (Eds.). https://hdl.handle.net/11353/10.931064
- coursera–dl. 2016. Rename PyPI package name from "coursera" to "coursera-dl" · coursera-dl/coursera-dl@c2f318a. https://github.com/coursera-dl/coursera-dl/commit/c2f318a57183800a8fb9360761651690d7db3e5a. (Accessed on 04/25/2024).
- Social Coding in GitHub: Transparency and Collaboration in an Open Software Repository. In Proceedings of the ACM 2012 Conference on Computer Supported Cooperative Work (Seattle, Washington, USA) (CSCW ’12). Association for Computing Machinery, New York, NY, USA, 1277–1286. https://doi.org/10.1145/2145204.2145396
- On the Impact of Security Vulnerabilities in the Npm Package Dependency Network. In Proceedings of the 15th International Conference on Mining Software Repositories (Gothenburg, Sweden) (MSR ’18). Association for Computing Machinery, New York, NY, USA, 181–191. https://doi.org/10.1145/3196398.3196401
- Towards Measuring Supply Chain Attacks on Package Managers for Interpreted Languages. In 28th Annual Network and Distributed System Security Symposium, NDSS 2021, virtually, February 21-25, 2021. The Internet Society. https://www.ndss-symposium.org/ndss-paper/towards-measuring-supply-chain-attacks-on-package-managers-for-interpreted-languages/
- Edward2. 2019. Move Edward2 from google-research/google-research to google/edward2. · google-research/google-research@f26db54. https://github.com/google-research/google-research/commit/f26db5490fa147a6052a78b2e479361833c3fd41. (Accessed on 04/25/2024).
- Need for Tweet: How Open Source Developers Talk About Their GitHub Work on Twitter. In Proceedings of the 17th International Conference on Mining Software Repositories (Seoul, Republic of Korea) (MSR ’20). Association for Computing Machinery, New York, NY, USA, 322–326. https://doi.org/10.1145/3379597.3387466
- Python Software Foundation. 2023. PEP 527 – Removing Un(der)used file types/extensions on PyPI | peps.python.org. https://peps.python.org/pep-0527/. (Accessed on 09/24/2023).
- Yoav Freund and Robert E Schapire. 1997. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences 55, 1 (1997), 119–139.
- Jerome H Friedman. 2001. Greedy function approximation: a gradient boosting machine. Annals of statistics (2001), 1189–1232.
- GitHub. 2023a. About the dependency graph - GitHub Docs. https://docs.github.com/en/code-security/supply-chain-security/understanding-your-software-supply-chain/about-the-dependency-graph. (Accessed on 09/21/2023).
- GitHub. 2023b. Search - GitHub Docs. https://docs.github.com/en/free-pro-team@latest/rest/search/search?apiVersion=2022-11-28#search-repositories. (Accessed on 09/21/2023).
- Michael W. Godfrey. 2015. Understanding software artifact provenance. Science of Computer Programming 97 (2015), 86–90. https://doi.org/10.1016/j.scico.2013.11.021 Special Issue on New Ideas and Emerging Results in Understanding Software.
- Google. 2021a. BigQuery dataset | Open Source Insights. https://docs.deps.dev/bigquery/v1/#packageversiontoproject. (Accessed on 04/24/2024).
- Google. 2021b. Frequently Asked Questions | Open Source Insights. https://docs.deps.dev/faq/. (Accessed on 04/24/2024).
- Google. 2021c. Open Source Insights. https://deps.dev/. (Accessed on 09/01/2023).
- use of the area under a receiver Operating Characteristics (ROC) curves. Radiology 143, 1 (1982), 29–36.
- Same File, Different Changes: The Potential of Meta-Maintenance on GitHub. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). 773–784. https://doi.org/10.1109/ICSE43902.2021.00076
- A Large-Scale Empirical Study on Java Library Migrations: Prevalence, Trends, and Rationales. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Athens, Greece) (ESEC/FSE 2021). Association for Computing Machinery, New York, NY, USA, 478–490. https://doi.org/10.1145/3468264.3468571
- SoK: Taxonomy of Attacks on Open-Source Software Supply Chains. In 2023 IEEE Symposium on Security and Privacy (SP). IEEE Computer Society, Los Alamitos, CA, USA, 1509–1526. https://doi.org/10.1109/SP46215.2023.10179304
- Selecting Third-Party Libraries: The Practitioners’ Perspective. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Virtual Event, USA) (ESEC/FSE 2020). Association for Computing Machinery, New York, NY, USA, 245–256. https://doi.org/10.1145/3368089.3409711
- Vladimir I Levenshtein et al. 1966. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, Vol. 10. Soviet Union, 707–710.
- Demystifying the Vulnerability Propagation and Its Evolution via Dependency Trees in the NPM Ecosystem. In Proceedings of the 44th International Conference on Software Engineering (Pittsburgh, Pennsylvania) (ICSE ’22). Association for Computing Machinery, New York, NY, USA, 672–684. https://doi.org/10.1145/3510003.3510142
- World of Code: An Infrastructure for Mining the Universe of Open Source VCS Data. In Proceedings of the 16th International Conference on Mining Software Repositories (Montreal, Quebec, Canada) (MSR ’19). IEEE Press, 143–154. https://doi.org/10.1109/MSR.2019.00031
- World of Code: Enabling a Research Workflow for Mining and Analyzing the Universe of Open Source VCS Data. Empirical Softw. Engg. 26, 2 (mar 2021), 42 pages. https://doi.org/10.1007/s10664-020-09905-9
- H. B. Mann and D. R. Whitney. 1947. On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other. The Annals of Mathematical Statistics 18, 1 (1947), 50 – 60. https://doi.org/10.1214/aoms/1177730491
- Francisco Melo. 2013. Area under the ROC Curve. Springer New York, New York, NY, 38–39. https://doi.org/10.1007/978-1-4419-9863-7_209
- Microsoft. 2020. microsoft/OSSGadget: Collection of tools for analyzing open source packages. https://github.com/microsoft/OSSGadget. (Accessed on 09/01/2023).
- Microsoft. 2023. OSS Find Source · microsoft/OSSGadget Wiki. https://github.com/microsoft/OSSGadget/wiki/OSS-Find-Source. (Accessed on 09/20/2023).
- CrossRec: Supporting software developers by recommending third-party libraries. Journal of Systems and Software 161 (2020), 110460. https://doi.org/10.1016/j.jss.2019.110460
- Backstabber’s Knife Collection: A Review of Open Source Software Supply Chain Attacks. In Detection of Intrusions and Malware, and Vulnerability Assessment, Clémentine Maurice, Leyla Bilge, Gianluca Stringhini, and Nuno Neves (Eds.). Springer International Publishing, Cham, 23–43.
- OpenSSF. 2023. OpenSSF Scorecard. https://securityscorecards.dev/. (Accessed on 09/20/2023).
- Automated Unearthing of Dangerous Issue Reports. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Singapore, Singapore) (ESEC/FSE 2022). Association for Computing Machinery, New York, NY, USA, 834–846. https://doi.org/10.1145/3540250.3549156
- “Won’t We Fix this Issue?” Qualitative characterization and automated identification of wontfix issues on GitHub. Information and Software Technology 139 (2021), 106665. https://doi.org/10.1016/j.infsof.2021.106665
- Vulnerable Open Source Dependencies: Counting Those That Matter. In Proceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (Oulu, Finland) (ESEM ’18). Association for Computing Machinery, New York, NY, USA, Article 42, 10 pages. https://doi.org/10.1145/3239235.3268920
- Ansible project contributors. 2023. Releases and maintenance — Ansible Documentation. https://docs.ansible.com/ansible/devel/reference_appendices/release_and_maintenance.html. (Accessed on 09/22/2023).
- PSF. 2023. Help · PyPI. https://pypi.org/help/#collaborator-roles. (Accessed on 09/21/2023).
- PyPA. 2023a. Core metadata specifications — Python Packaging User Guide. https://packaging.python.org/en/latest/specifications/core-metadata/. (Accessed on 09/13/2023).
- PyPA. 2023b. Glossary — Python Packaging User Guide. https://packaging.python.org/en/latest/glossary/. (Accessed on 09/16/2023).
- PyPA. 2023c. Packaging and distributing projects — Python Packaging User Guide. https://packaging.python.org/en/latest/guides/distributing-packages-using-setuptools/#packaging-and-distributing-projects. (Accessed on 09/13/2023).
- PyPI. 2023. Warehouse documentation. https://warehouse.pypa.io/. (Accessed on 09/01/2023).
- The Extent of Orphan Vulnerabilities from Code Reuse in Open Source Software. In Proceedings of the 44th International Conference on Software Engineering (Pittsburgh, Pennsylvania) (ICSE ’22). Association for Computing Machinery, New York, NY, USA, 2104–2115. https://doi.org/10.1145/3510003.3510216
- Software provenance tracking at the scale of public source code. Empirical Software Engineering 25, 4 (01 Jul 2020), 2930–2959. https://doi.org/10.1007/s10664-020-09828-5
- Chacon Scott and Straub Ben. 2023. Git - Submodules. https://git-scm.com/book/en/v2/Git-Tools-Submodules. (Accessed on 09/19/2023).
- Linking Accounts across Social Networks: the Case of StackOverflow, Github and Twitter.. In KDWeb. 41–52.
- Snyk. 2023. Snyk Open Source Advisor | Snyk. https://snyk.io/advisor/python. (Accessed on 09/01/2023).
- Using the uniqueness of global identifiers to determine the provenance of Python software source code. Empirical Software Engineering 28, 5 (20 Jul 2023), 107. https://doi.org/10.1007/s10664-023-10317-8
- swsc. 2023. swsc / overview — Bitbucket. https://bitbucket.org/swsc/overview/src/master/. (Accessed on 09/18/2023).
- Defending Against Package Typosquatting. In Network and System Security, Mirosław Kutyłowski, Jun Zhang, and Chao Chen (Eds.). Springer International Publishing, Cham, 112–131.
- What Makes a Good Commit Message?. In Proceedings of the 44th International Conference on Software Engineering (Pittsburgh, Pennsylvania) (ICSE ’22). Association for Computing Machinery, New York, NY, USA, 2389–2401. https://doi.org/10.1145/3510003.3510205
- Tidelift. 2015. Libraries.io - The Open Source Discovery Service. https://libraries.io/. (Accessed on 09/01/2023).
- Influence of Social and Technical Factors for Evaluating Contribution in GitHub. In Proceedings of the 36th International Conference on Software Engineering (Hyderabad, India) (ICSE 2014). Association for Computing Machinery, New York, NY, USA, 356–366. https://doi.org/10.1145/2568225.2568315
- Ecosystem-Level Determinants of Sustained Activity in Open-Source Projects: A Case Study of the PyPI Ecosystem. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Lake Buena Vista, FL, USA) (ESEC/FSE 2018). Association for Computing Machinery, New York, NY, USA, 644–655. https://doi.org/10.1145/3236024.3236062
- StackOverflow and GitHub: Associations between Software Development and Crowdsourced Knowledge. In 2013 International Conference on Social Computing. 188–195. https://doi.org/10.1109/SocialCom.2013.35
- Duc-Ly Vu. 2021. py2src: Towards the Automatic (and Reliable) Identification of Sources for PyPI Package. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). 1394–1396. https://doi.org/10.1109/ASE51524.2021.9678526
- LastPyMile: Identifying the Discrepancy between Sources and Packages. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Athens, Greece) (ESEC/FSE 2021). Association for Computing Machinery, New York, NY, USA, 780–792. https://doi.org/10.1145/3468264.3468592
- Typosquatting and Combosquatting Attacks on the Python Ecosystem. In 2020 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW). 509–514. https://doi.org/10.1109/EuroSPW51379.2020.00074
- Watchman: Monitoring Dependency Conflicts for Python Library Ecosystem. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering (Seoul, South Korea) (ICSE ’20). Association for Computing Machinery, New York, NY, USA, 125–135. https://doi.org/10.1145/3377811.3380426
- Characterize Software Release Notes of GitHub Projects: Structure, Writing Style, and Content. In 2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). 473–484. https://doi.org/10.1109/SANER56733.2023.00051
- What the Fork? Finding Hidden Code Clones in Npm. In Proceedings of the 44th International Conference on Software Engineering (Pittsburgh, Pennsylvania) (ICSE ’22). Association for Computing Machinery, New York, NY, USA, 2415–2426. https://doi.org/10.1145/3510003.3510168
- Recommending Good First Issues in GitHub OSS Projects. In 44th IEEE/ACM 44th International Conference on Software Engineering, ICSE 2022, Pittsburgh, PA, USA, May 25-27, 2022. ACM, 1830–1842. https://doi.org/10.1145/3510003.3510196
- Tracking patches for open source software vulnerabilities. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (<conf-loc>, <city>Singapore</city>, <country>Singapore</country>, </conf-loc>) (ESEC/FSE 2022). Association for Computing Machinery, New York, NY, USA, 860–871. https://doi.org/10.1145/3540250.3549125
- Understanding and Remediating Open-Source License Incompatibilities in the PyPI Ecosystem. In Proceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering (Kirchberg, Luxembourg) (ASE ’23). Association for Computing Machinery, New York, NY, USA.
- Minghui Zhou and Audris Mockus. 2012. What make long term contributors: Willingness and opportunity in OSS community. In 2012 34th International Conference on Software Engineering (ICSE). 518–528. https://doi.org/10.1109/ICSE.2012.6227164
- Minghui Zhou and Audris Mockus. 2015. Who Will Stay in the FLOSS Community? Modeling Participant’s Initial Behavior. IEEE Transactions on Software Engineering 41, 1 (2015), 82–99. https://doi.org/10.1109/TSE.2014.2349496
- Effectiveness of Code Contribution: From Patch-Based to Pull-Request-Based Tools. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering (Seattle, WA, USA) (FSE 2016). Association for Computing Machinery, New York, NY, USA, 871–882. https://doi.org/10.1145/2950290.2950364
- Small World with High Risks: A Study of Security Threats in the npm Ecosystem. In 28th USENIX Security Symposium (USENIX Security 19). USENIX Association, Santa Clara, CA, 995–1010. https://www.usenix.org/conference/usenixsecurity19/presentation/zimmerman