Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PyRadar: Towards Automatically Retrieving and Validating Source Code Repository Information for PyPI Packages (2404.16565v1)

Published 25 Apr 2024 in cs.SE

Abstract: A package's source code repository records the development history of the package, providing indispensable information for the use and risk monitoring of the package. However, a package release often misses its source code repository due to the separation of the package's development platform from its distribution platform. Existing tools retrieve the release's repository information from its metadata, which suffers from two limitations: the metadata may not contain or contain wrong information. Our analysis shows that existing tools can only retrieve repository information for up to 70.5% of PyPI releases. To address the limitations, this paper proposes PyRadar, a novel framework that utilizes the metadata and source distribution to retrieve and validate the repository information for PyPI releases. We start with an empirical study to compare four existing tools on 4,227,425 PyPI releases and analyze phantom files (files appearing in the release's distribution but not in the release's repository) in 14,375 correct package-repository links and 2,064 incorrect links. Based on the findings, we design PyRadar with three components, i.e., Metadata-based Retriever, Source Code Repository Validator, and Source Code-based Retriever. In particular, the Metadata-based Retriever combines best practices of existing tools and successfully retrieves repository information from the metadata for 72.1% of PyPI releases. The Source Code Repository Validator applies common machine learning algorithms on six crafted features and achieves an AUC of up to 0.995. The Source Code-based Retriever queries World of Code with the SHA-1 hashes of all Python files in the release's source distribution and retrieves repository information for 90.2% of packages in our dataset with an accuracy of 0.970. Both practitioners and researchers can employ the PyRadar to better use PyPI packages.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (73)
  1. Empirical Analysis of Security Vulnerabilities in Python Packages. In 2021 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). 446–457. https://doi.org/10.1109/SANER50967.2021.00048
  2. The promises and perils of mining git. In 2009 6th IEEE International Working Conference on Mining Software Repositories. 1–10. https://doi.org/10.1109/MSR.2009.5069475
  3. Understanding the Factors That Impact the Popularity of GitHub Repositories. In 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME). 334–344. https://doi.org/10.1109/ICSME.2016.31
  4. Leo Breiman. 2001. Random Forests. Machine Learning 45, 1 (01 Oct 2001), 5–32. https://doi.org/10.1023/A:1010933404324
  5. Scott Chacon and Ben Straub. 2023. Git - Git Objects. https://git-scm.com/book/en/v2/Git-Internals-Git-Objects. (Accessed on 09/14/2023).
  6. Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016, Balaji Krishnapuram, Mohak Shah, Alexander J. Smola, Charu C. Aggarwal, Dou Shen, and Rajeev Rastogi (Eds.). ACM, 785–794. https://doi.org/10.1145/2939672.2939785
  7. Roberto Di Cosmo and Stefano Zacchiroli. 2017. Software Heritage: Why and How to Preserve Software Source Code. In Proceedings of the 14th International Conference on Digital Preservation, iPRES 2017, Kyoto, Japan, September 25-29, 2017, Shoichiro Hara, Shigeo Sugimoto, and Makoto Goto (Eds.). https://hdl.handle.net/11353/10.931064
  8. coursera–dl. 2016. Rename PyPI package name from "coursera" to "coursera-dl" · coursera-dl/coursera-dl@c2f318a. https://github.com/coursera-dl/coursera-dl/commit/c2f318a57183800a8fb9360761651690d7db3e5a. (Accessed on 04/25/2024).
  9. Social Coding in GitHub: Transparency and Collaboration in an Open Software Repository. In Proceedings of the ACM 2012 Conference on Computer Supported Cooperative Work (Seattle, Washington, USA) (CSCW ’12). Association for Computing Machinery, New York, NY, USA, 1277–1286. https://doi.org/10.1145/2145204.2145396
  10. On the Impact of Security Vulnerabilities in the Npm Package Dependency Network. In Proceedings of the 15th International Conference on Mining Software Repositories (Gothenburg, Sweden) (MSR ’18). Association for Computing Machinery, New York, NY, USA, 181–191. https://doi.org/10.1145/3196398.3196401
  11. Towards Measuring Supply Chain Attacks on Package Managers for Interpreted Languages. In 28th Annual Network and Distributed System Security Symposium, NDSS 2021, virtually, February 21-25, 2021. The Internet Society. https://www.ndss-symposium.org/ndss-paper/towards-measuring-supply-chain-attacks-on-package-managers-for-interpreted-languages/
  12. Edward2. 2019. Move Edward2 from google-research/google-research to google/edward2. · google-research/google-research@f26db54. https://github.com/google-research/google-research/commit/f26db5490fa147a6052a78b2e479361833c3fd41. (Accessed on 04/25/2024).
  13. Need for Tweet: How Open Source Developers Talk About Their GitHub Work on Twitter. In Proceedings of the 17th International Conference on Mining Software Repositories (Seoul, Republic of Korea) (MSR ’20). Association for Computing Machinery, New York, NY, USA, 322–326. https://doi.org/10.1145/3379597.3387466
  14. Python Software Foundation. 2023. PEP 527 – Removing Un(der)used file types/extensions on PyPI | peps.python.org. https://peps.python.org/pep-0527/. (Accessed on 09/24/2023).
  15. Yoav Freund and Robert E Schapire. 1997. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences 55, 1 (1997), 119–139.
  16. Jerome H Friedman. 2001. Greedy function approximation: a gradient boosting machine. Annals of statistics (2001), 1189–1232.
  17. GitHub. 2023a. About the dependency graph - GitHub Docs. https://docs.github.com/en/code-security/supply-chain-security/understanding-your-software-supply-chain/about-the-dependency-graph. (Accessed on 09/21/2023).
  18. GitHub. 2023b. Search - GitHub Docs. https://docs.github.com/en/free-pro-team@latest/rest/search/search?apiVersion=2022-11-28#search-repositories. (Accessed on 09/21/2023).
  19. Michael W. Godfrey. 2015. Understanding software artifact provenance. Science of Computer Programming 97 (2015), 86–90. https://doi.org/10.1016/j.scico.2013.11.021 Special Issue on New Ideas and Emerging Results in Understanding Software.
  20. Google. 2021a. BigQuery dataset | Open Source Insights. https://docs.deps.dev/bigquery/v1/#packageversiontoproject. (Accessed on 04/24/2024).
  21. Google. 2021b. Frequently Asked Questions | Open Source Insights. https://docs.deps.dev/faq/. (Accessed on 04/24/2024).
  22. Google. 2021c. Open Source Insights. https://deps.dev/. (Accessed on 09/01/2023).
  23. use of the area under a receiver Operating Characteristics (ROC) curves. Radiology 143, 1 (1982), 29–36.
  24. Same File, Different Changes: The Potential of Meta-Maintenance on GitHub. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). 773–784. https://doi.org/10.1109/ICSE43902.2021.00076
  25. A Large-Scale Empirical Study on Java Library Migrations: Prevalence, Trends, and Rationales. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Athens, Greece) (ESEC/FSE 2021). Association for Computing Machinery, New York, NY, USA, 478–490. https://doi.org/10.1145/3468264.3468571
  26. SoK: Taxonomy of Attacks on Open-Source Software Supply Chains. In 2023 IEEE Symposium on Security and Privacy (SP). IEEE Computer Society, Los Alamitos, CA, USA, 1509–1526. https://doi.org/10.1109/SP46215.2023.10179304
  27. Selecting Third-Party Libraries: The Practitioners’ Perspective. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Virtual Event, USA) (ESEC/FSE 2020). Association for Computing Machinery, New York, NY, USA, 245–256. https://doi.org/10.1145/3368089.3409711
  28. Vladimir I Levenshtein et al. 1966. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, Vol. 10. Soviet Union, 707–710.
  29. Demystifying the Vulnerability Propagation and Its Evolution via Dependency Trees in the NPM Ecosystem. In Proceedings of the 44th International Conference on Software Engineering (Pittsburgh, Pennsylvania) (ICSE ’22). Association for Computing Machinery, New York, NY, USA, 672–684. https://doi.org/10.1145/3510003.3510142
  30. World of Code: An Infrastructure for Mining the Universe of Open Source VCS Data. In Proceedings of the 16th International Conference on Mining Software Repositories (Montreal, Quebec, Canada) (MSR ’19). IEEE Press, 143–154. https://doi.org/10.1109/MSR.2019.00031
  31. World of Code: Enabling a Research Workflow for Mining and Analyzing the Universe of Open Source VCS Data. Empirical Softw. Engg. 26, 2 (mar 2021), 42 pages. https://doi.org/10.1007/s10664-020-09905-9
  32. H. B. Mann and D. R. Whitney. 1947. On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other. The Annals of Mathematical Statistics 18, 1 (1947), 50 – 60. https://doi.org/10.1214/aoms/1177730491
  33. Francisco Melo. 2013. Area under the ROC Curve. Springer New York, New York, NY, 38–39. https://doi.org/10.1007/978-1-4419-9863-7_209
  34. Microsoft. 2020. microsoft/OSSGadget: Collection of tools for analyzing open source packages. https://github.com/microsoft/OSSGadget. (Accessed on 09/01/2023).
  35. Microsoft. 2023. OSS Find Source · microsoft/OSSGadget Wiki. https://github.com/microsoft/OSSGadget/wiki/OSS-Find-Source. (Accessed on 09/20/2023).
  36. CrossRec: Supporting software developers by recommending third-party libraries. Journal of Systems and Software 161 (2020), 110460. https://doi.org/10.1016/j.jss.2019.110460
  37. Backstabber’s Knife Collection: A Review of Open Source Software Supply Chain Attacks. In Detection of Intrusions and Malware, and Vulnerability Assessment, Clémentine Maurice, Leyla Bilge, Gianluca Stringhini, and Nuno Neves (Eds.). Springer International Publishing, Cham, 23–43.
  38. OpenSSF. 2023. OpenSSF Scorecard. https://securityscorecards.dev/. (Accessed on 09/20/2023).
  39. Automated Unearthing of Dangerous Issue Reports. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Singapore, Singapore) (ESEC/FSE 2022). Association for Computing Machinery, New York, NY, USA, 834–846. https://doi.org/10.1145/3540250.3549156
  40. “Won’t We Fix this Issue?” Qualitative characterization and automated identification of wontfix issues on GitHub. Information and Software Technology 139 (2021), 106665. https://doi.org/10.1016/j.infsof.2021.106665
  41. Vulnerable Open Source Dependencies: Counting Those That Matter. In Proceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (Oulu, Finland) (ESEM ’18). Association for Computing Machinery, New York, NY, USA, Article 42, 10 pages. https://doi.org/10.1145/3239235.3268920
  42. Ansible project contributors. 2023. Releases and maintenance — Ansible Documentation. https://docs.ansible.com/ansible/devel/reference_appendices/release_and_maintenance.html. (Accessed on 09/22/2023).
  43. PSF. 2023. Help · PyPI. https://pypi.org/help/#collaborator-roles. (Accessed on 09/21/2023).
  44. PyPA. 2023a. Core metadata specifications — Python Packaging User Guide. https://packaging.python.org/en/latest/specifications/core-metadata/. (Accessed on 09/13/2023).
  45. PyPA. 2023b. Glossary — Python Packaging User Guide. https://packaging.python.org/en/latest/glossary/. (Accessed on 09/16/2023).
  46. PyPA. 2023c. Packaging and distributing projects — Python Packaging User Guide. https://packaging.python.org/en/latest/guides/distributing-packages-using-setuptools/#packaging-and-distributing-projects. (Accessed on 09/13/2023).
  47. PyPI. 2023. Warehouse documentation. https://warehouse.pypa.io/. (Accessed on 09/01/2023).
  48. The Extent of Orphan Vulnerabilities from Code Reuse in Open Source Software. In Proceedings of the 44th International Conference on Software Engineering (Pittsburgh, Pennsylvania) (ICSE ’22). Association for Computing Machinery, New York, NY, USA, 2104–2115. https://doi.org/10.1145/3510003.3510216
  49. Software provenance tracking at the scale of public source code. Empirical Software Engineering 25, 4 (01 Jul 2020), 2930–2959. https://doi.org/10.1007/s10664-020-09828-5
  50. Chacon Scott and Straub Ben. 2023. Git - Submodules. https://git-scm.com/book/en/v2/Git-Tools-Submodules. (Accessed on 09/19/2023).
  51. Linking Accounts across Social Networks: the Case of StackOverflow, Github and Twitter.. In KDWeb. 41–52.
  52. Snyk. 2023. Snyk Open Source Advisor | Snyk. https://snyk.io/advisor/python. (Accessed on 09/01/2023).
  53. Using the uniqueness of global identifiers to determine the provenance of Python software source code. Empirical Software Engineering 28, 5 (20 Jul 2023), 107. https://doi.org/10.1007/s10664-023-10317-8
  54. swsc. 2023. swsc / overview — Bitbucket. https://bitbucket.org/swsc/overview/src/master/. (Accessed on 09/18/2023).
  55. Defending Against Package Typosquatting. In Network and System Security, Mirosław Kutyłowski, Jun Zhang, and Chao Chen (Eds.). Springer International Publishing, Cham, 112–131.
  56. What Makes a Good Commit Message?. In Proceedings of the 44th International Conference on Software Engineering (Pittsburgh, Pennsylvania) (ICSE ’22). Association for Computing Machinery, New York, NY, USA, 2389–2401. https://doi.org/10.1145/3510003.3510205
  57. Tidelift. 2015. Libraries.io - The Open Source Discovery Service. https://libraries.io/. (Accessed on 09/01/2023).
  58. Influence of Social and Technical Factors for Evaluating Contribution in GitHub. In Proceedings of the 36th International Conference on Software Engineering (Hyderabad, India) (ICSE 2014). Association for Computing Machinery, New York, NY, USA, 356–366. https://doi.org/10.1145/2568225.2568315
  59. Ecosystem-Level Determinants of Sustained Activity in Open-Source Projects: A Case Study of the PyPI Ecosystem. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Lake Buena Vista, FL, USA) (ESEC/FSE 2018). Association for Computing Machinery, New York, NY, USA, 644–655. https://doi.org/10.1145/3236024.3236062
  60. StackOverflow and GitHub: Associations between Software Development and Crowdsourced Knowledge. In 2013 International Conference on Social Computing. 188–195. https://doi.org/10.1109/SocialCom.2013.35
  61. Duc-Ly Vu. 2021. py2src: Towards the Automatic (and Reliable) Identification of Sources for PyPI Package. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). 1394–1396. https://doi.org/10.1109/ASE51524.2021.9678526
  62. LastPyMile: Identifying the Discrepancy between Sources and Packages. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Athens, Greece) (ESEC/FSE 2021). Association for Computing Machinery, New York, NY, USA, 780–792. https://doi.org/10.1145/3468264.3468592
  63. Typosquatting and Combosquatting Attacks on the Python Ecosystem. In 2020 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW). 509–514. https://doi.org/10.1109/EuroSPW51379.2020.00074
  64. Watchman: Monitoring Dependency Conflicts for Python Library Ecosystem. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering (Seoul, South Korea) (ICSE ’20). Association for Computing Machinery, New York, NY, USA, 125–135. https://doi.org/10.1145/3377811.3380426
  65. Characterize Software Release Notes of GitHub Projects: Structure, Writing Style, and Content. In 2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). 473–484. https://doi.org/10.1109/SANER56733.2023.00051
  66. What the Fork? Finding Hidden Code Clones in Npm. In Proceedings of the 44th International Conference on Software Engineering (Pittsburgh, Pennsylvania) (ICSE ’22). Association for Computing Machinery, New York, NY, USA, 2415–2426. https://doi.org/10.1145/3510003.3510168
  67. Recommending Good First Issues in GitHub OSS Projects. In 44th IEEE/ACM 44th International Conference on Software Engineering, ICSE 2022, Pittsburgh, PA, USA, May 25-27, 2022. ACM, 1830–1842. https://doi.org/10.1145/3510003.3510196
  68. Tracking patches for open source software vulnerabilities. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (<conf-loc>, <city>Singapore</city>, <country>Singapore</country>, </conf-loc>) (ESEC/FSE 2022). Association for Computing Machinery, New York, NY, USA, 860–871. https://doi.org/10.1145/3540250.3549125
  69. Understanding and Remediating Open-Source License Incompatibilities in the PyPI Ecosystem. In Proceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering (Kirchberg, Luxembourg) (ASE ’23). Association for Computing Machinery, New York, NY, USA.
  70. Minghui Zhou and Audris Mockus. 2012. What make long term contributors: Willingness and opportunity in OSS community. In 2012 34th International Conference on Software Engineering (ICSE). 518–528. https://doi.org/10.1109/ICSE.2012.6227164
  71. Minghui Zhou and Audris Mockus. 2015. Who Will Stay in the FLOSS Community? Modeling Participant’s Initial Behavior. IEEE Transactions on Software Engineering 41, 1 (2015), 82–99. https://doi.org/10.1109/TSE.2014.2349496
  72. Effectiveness of Code Contribution: From Patch-Based to Pull-Request-Based Tools. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering (Seattle, WA, USA) (FSE 2016). Association for Computing Machinery, New York, NY, USA, 871–882. https://doi.org/10.1145/2950290.2950364
  73. Small World with High Risks: A Study of Security Threats in the npm Ecosystem. In 28th USENIX Security Symposium (USENIX Security 19). USENIX Association, Santa Clara, CA, 995–1010. https://www.usenix.org/conference/usenixsecurity19/presentation/zimmerman
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com