PeaTMOSS: Mining Pre-Trained Models in Open-Source Software (2310.03620v1)
Abstract: Developing and training deep learning models is expensive, so software engineers have begun to reuse pre-trained deep learning models (PTMs) and fine-tune them for downstream tasks. Despite the wide-spread use of PTMs, we know little about the corresponding software engineering behaviors and challenges. To enable the study of software engineering with PTMs, we present the PeaTMOSS dataset: Pre-Trained Models in Open-Source Software. PeaTMOSS has three parts: a snapshot of (1) 281,638 PTMs, (2) 27,270 open-source software repositories that use PTMs, and (3) a mapping between PTMs and the projects that use them. We challenge PeaTMOSS miners to discover software engineering practices around PTMs. A demo and link to the full dataset are available at: https://github.com/PurdueDualityLab/PeaTMOSS-Demos.
- M. M. Morovati, A. Nikanjam, F. Tambon, F. Khomh, and Z. Ming, “Bug characterization in machine learning-based systems,” arXiv, 2023. [Online]. Available: https://arxiv.org/abs/2307.14512
- L. Yin and V. Filkov, “Team discussions and dynamics during devops tool adoptions in oss projects,” in Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering, 2020, pp. 697–708.
- A. Sajadi, K. Damevski, and P. Chatterjee, “Interpersonal trust in oss: Exploring dimensions of trust in github pull requests,” in 2023 IEEE/ACM 45th International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER). IEEE, 2023, pp. 19–24.
- M. Dilhara, A. Ketkar, and D. Dig, “Understanding software-2.0: A study of machine learning library usage and evolution,” ACM Transactions on Software Engineering and Methodology (TOSEM), vol. 30, no. 4, pp. 1–42, 2021.
- Y. Li, Z. Zhang, B. Liu, Z. Yang, and Y. Liu, “Modeldiff: Testing-based dnn similarity comparison for model reuse detection,” in Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis, 2021, pp. 139–151.
- Z. Zhang, Y. Li, J. Wang, B. Liu, D. Li, Y. Guo, X. Chen, and Y. Liu, “Remos: reducing defect inheritance in transfer learning via relevant model slicing,” in Proceedings of the 44th International Conference on Software Engineering, 2022, pp. 1856–1868.
- B. Qi, H. Sun, X. Gao, H. Zhang, Z. Li, and X. Liu, “Reusing deep neural network models through model re-engineering,” arXiv preprint arXiv:2304.00245, 2023.
- N. Nahar, H. Zhang, G. Lewis, S. Zhou, and C. Kästner, “A dataset and analysis of open-source machine learning products,” arXiv preprint arXiv:2308.04328, 2023.
- C. Lima and A. Hora, “What are the characteristics of popular apis? a large-scale study on java, android, and 165 libraries,” Software Quality Journal, vol. 28, no. 2, pp. 425–458, 2020.
- W. Jiang, N. Synovic, M. Hyatt, T. R. Schorlemmer, R. Sethi, Y.-H. Lu, G. K. Thiruvathukal, and J. C. Davis, “An empirical study of pre-trained model reuse in the hugging face deep learning model registry,” in IEEE/ACM 45th International Conference on Software Engineering (ICSE’23), 2023.
- W. Jiang, N. Synovic, and R. Sethi, “An Empirical Study of Artifacts and Security Risks in the Pre-trained Model Supply Chain,” Los Angeles, p. 10, 2022.
- A. Bhat, A. Coursey, G. Hu, S. Li, N. Nahar, S. Zhou, C. Kästner, and J. L. Guo, “Aspirations and practice of ml model documentation: Moving the needle with nudging and traceability,” in Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, 2023, pp. 1–17.
- R. Gresta, V. Durelli, and E. Cirilo, “Naming practices in object-oriented programming: An empirical study,” Journal of Software Engineering Research and Development, pp. 5–1, 2023.
- S. Wang, S. Nepal, C. Rudolph, M. Grobler, S. Chen, and T. Chen, “Backdoor Attacks Against Transfer Learning With Pre-Trained Deep Learning Models,” IEEE Transactions on Services Computing, vol. 15, no. 3, pp. 1526–1539, May 2022.
- Z. Wang, C. Liu, X. Cui, J. Yin, and X. Wang, “EvilModel 2.0: Bringing Neural Network Models into Malware Attacks,” Computers & Security, 2022.
- S. Guo, C. Xie, J. Li, L. Lyu, and T. Zhang, “Threats to pre-trained language models: Survey and taxonomy,” arXiv preprint arXiv:2202.06862, 2022.
- Y. Fan, X. Xia, D. Lo, A. E. Hassan, and S. Li, “What makes a popular academic ai repository?” Empirical Software Engineering, vol. 26, pp. 1–35, 2021.
- F. Palomba, D. A. Tamburri, F. A. Fontana, R. Oliveto, A. Zaidman, and A. Serebrenik, “Beyond technical aspects: How do community smells influence the intensity of code smells?” IEEE transactions on software engineering, vol. 47, no. 1, pp. 108–129, 2018.
- H. Zhang, L. Cruz, and A. Van Deursen, “Code smells for machine learning applications,” in Proceedings of the 1st International Conference on AI Engineering: Software Engineering for AI, 2022, pp. 217–228.
- B. Van Oort, L. Cruz, M. Aniche, and A. Van Deursen, “The prevalence of code smells in machine learning projects,” in 2021 IEEE/ACM 1st Workshop on AI Engineering-Software Engineering for AI (WAIN). IEEE, 2021, pp. 1–8.
- N. Cardozo, I. Dusparic, and C. Cabrera, “Prevalence of code smells in reinforcement learning projects,” arXiv preprint arXiv:2303.10236, 2023.
- S. Li, J. Guo, J.-G. Lou, M. Fan, T. Liu‡, and D. Zhang, “Testing Machine Learning Systems in Industry: An Empirical Study,” in 2022 IEEE/ACM 44th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), 2022, pp. 263–272.
- H. B. Braiek and F. Khomh, “On testing machine learning programs,” Journal of Systems and Software (JSS), vol. 164, p. 110542, 2020.
- M. Eck, F. Palomba, M. Castelluccio, and A. Bacchelli, “Understanding flaky tests: The developer’s perspective,” in Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2019, pp. 830–840.
- A. Hora, R. Robbes, M. T. Valente, N. Anquetil, A. Etien, and S. Ducasse, “How do developers react to api evolution? a large-scale empirical study,” Software Quality Journal, vol. 26, pp. 161–191, 2018.
- C. Wan, S. Liu, H. Hoffmann, M. Maire, and S. Lu, “Are machine learning cloud apis used correctly?” in 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 2021, pp. 125–137.
- W. Jiang, V. Banna, N. Vivek, A. Goel, N. Synovic, G. K. Thiruvathukal, and J. C. Davis, “Challenges and practices of deep learning model reengineering: A case study on computer vision,” arXiv preprint arXiv:2303.07476, 2023.
- Z. Yang, C. Wang, J. Shi, T. Hoang, P. Kochhar, Q. Lu, Z. Xing, and D. Lo, “What do users ask in open-source ai repositories? an empirical study of github issues,” arXiv preprint arXiv:2303.09795, 2023.
- M. Liu, C. Zhao, X. Peng, S. Yu, H. Wang, and C. Sha, “Task-oriented ml/dl library recommendation based on a knowledge graph,” IEEE Transactions on Software Engineering, pp. 1–16, 2023.
- Z. Zhang, L. K. Ng, B. Liu, Y. Cai, D. Li, Y. Guo, and X. Chen, “Teeslice: slicing dnn models for secure and efficient deployment,” in Proceedings of the 2nd ACM International Workshop on AI and Software Testing/Analysis, 2022, pp. 1–8.
- D. Velasco-Montero, J. Fernández-Berni, R. Carmona-Galán, and Á. Rodríguez-Vázquez, “Optimum selection of dnn model and framework for edge inference,” IEEE Access, vol. 6, pp. 51 680–51 692, 2018.
- A. Kathikar, A. Nair, B. Lazarine, A. Sachdeva, and S. Samtani, “Assessing the vulnerabilities of the open-source artificial intelligence (ai) landscape: A large-scale analysis of the hugging face platform,” 2023.
- J. Castaño, S. Martínez-Fernández, X. Franch, and J. Bogner, “Exploring the carbon footprint of hugging face’s ml models: A repository mining study,” arXiv, 2023. [Online]. Available: https://arxiv.org/pdf/2305.11164.pdf
- J. C. Davis, P. Jajal, W. Jiang, T. R. Schorlemmer, N. Synovic, and G. K. Thiruvathukal, “Reusing deep learning models: Challenges and directions in software engineering,” in Proceedings of the IEEE John Vincent Atanasoff Symposium on Modern Computing (JVA’23), 2023.
- A. Ait, J. L. C. Izquierdo, and J. Cabot, “Hfcommunity: A tool to analyze the hugging face hub community,” in 2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 2023, pp. 728–732.
- L. Li, J. Wang, and H. Quan, “Scalpel: The python static analysis framework,” arXiv preprint arXiv:2202.11840, 2022.