Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PeaTMOSS: Mining Pre-Trained Models in Open-Source Software (2310.03620v1)

Published 5 Oct 2023 in cs.SE and cs.AI

Abstract: Developing and training deep learning models is expensive, so software engineers have begun to reuse pre-trained deep learning models (PTMs) and fine-tune them for downstream tasks. Despite the wide-spread use of PTMs, we know little about the corresponding software engineering behaviors and challenges. To enable the study of software engineering with PTMs, we present the PeaTMOSS dataset: Pre-Trained Models in Open-Source Software. PeaTMOSS has three parts: a snapshot of (1) 281,638 PTMs, (2) 27,270 open-source software repositories that use PTMs, and (3) a mapping between PTMs and the projects that use them. We challenge PeaTMOSS miners to discover software engineering practices around PTMs. A demo and link to the full dataset are available at: https://github.com/PurdueDualityLab/PeaTMOSS-Demos.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. M. M. Morovati, A. Nikanjam, F. Tambon, F. Khomh, and Z. Ming, “Bug characterization in machine learning-based systems,” arXiv, 2023. [Online]. Available: https://arxiv.org/abs/2307.14512
  2. L. Yin and V. Filkov, “Team discussions and dynamics during devops tool adoptions in oss projects,” in Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering, 2020, pp. 697–708.
  3. A. Sajadi, K. Damevski, and P. Chatterjee, “Interpersonal trust in oss: Exploring dimensions of trust in github pull requests,” in 2023 IEEE/ACM 45th International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER).   IEEE, 2023, pp. 19–24.
  4. M. Dilhara, A. Ketkar, and D. Dig, “Understanding software-2.0: A study of machine learning library usage and evolution,” ACM Transactions on Software Engineering and Methodology (TOSEM), vol. 30, no. 4, pp. 1–42, 2021.
  5. Y. Li, Z. Zhang, B. Liu, Z. Yang, and Y. Liu, “Modeldiff: Testing-based dnn similarity comparison for model reuse detection,” in Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis, 2021, pp. 139–151.
  6. Z. Zhang, Y. Li, J. Wang, B. Liu, D. Li, Y. Guo, X. Chen, and Y. Liu, “Remos: reducing defect inheritance in transfer learning via relevant model slicing,” in Proceedings of the 44th International Conference on Software Engineering, 2022, pp. 1856–1868.
  7. B. Qi, H. Sun, X. Gao, H. Zhang, Z. Li, and X. Liu, “Reusing deep neural network models through model re-engineering,” arXiv preprint arXiv:2304.00245, 2023.
  8. N. Nahar, H. Zhang, G. Lewis, S. Zhou, and C. Kästner, “A dataset and analysis of open-source machine learning products,” arXiv preprint arXiv:2308.04328, 2023.
  9. C. Lima and A. Hora, “What are the characteristics of popular apis? a large-scale study on java, android, and 165 libraries,” Software Quality Journal, vol. 28, no. 2, pp. 425–458, 2020.
  10. W. Jiang, N. Synovic, M. Hyatt, T. R. Schorlemmer, R. Sethi, Y.-H. Lu, G. K. Thiruvathukal, and J. C. Davis, “An empirical study of pre-trained model reuse in the hugging face deep learning model registry,” in IEEE/ACM 45th International Conference on Software Engineering (ICSE’23), 2023.
  11. W. Jiang, N. Synovic, and R. Sethi, “An Empirical Study of Artifacts and Security Risks in the Pre-trained Model Supply Chain,” Los Angeles, p. 10, 2022.
  12. A. Bhat, A. Coursey, G. Hu, S. Li, N. Nahar, S. Zhou, C. Kästner, and J. L. Guo, “Aspirations and practice of ml model documentation: Moving the needle with nudging and traceability,” in Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, 2023, pp. 1–17.
  13. R. Gresta, V. Durelli, and E. Cirilo, “Naming practices in object-oriented programming: An empirical study,” Journal of Software Engineering Research and Development, pp. 5–1, 2023.
  14. S. Wang, S. Nepal, C. Rudolph, M. Grobler, S. Chen, and T. Chen, “Backdoor Attacks Against Transfer Learning With Pre-Trained Deep Learning Models,” IEEE Transactions on Services Computing, vol. 15, no. 3, pp. 1526–1539, May 2022.
  15. Z. Wang, C. Liu, X. Cui, J. Yin, and X. Wang, “EvilModel 2.0: Bringing Neural Network Models into Malware Attacks,” Computers & Security, 2022.
  16. S. Guo, C. Xie, J. Li, L. Lyu, and T. Zhang, “Threats to pre-trained language models: Survey and taxonomy,” arXiv preprint arXiv:2202.06862, 2022.
  17. Y. Fan, X. Xia, D. Lo, A. E. Hassan, and S. Li, “What makes a popular academic ai repository?” Empirical Software Engineering, vol. 26, pp. 1–35, 2021.
  18. F. Palomba, D. A. Tamburri, F. A. Fontana, R. Oliveto, A. Zaidman, and A. Serebrenik, “Beyond technical aspects: How do community smells influence the intensity of code smells?” IEEE transactions on software engineering, vol. 47, no. 1, pp. 108–129, 2018.
  19. H. Zhang, L. Cruz, and A. Van Deursen, “Code smells for machine learning applications,” in Proceedings of the 1st International Conference on AI Engineering: Software Engineering for AI, 2022, pp. 217–228.
  20. B. Van Oort, L. Cruz, M. Aniche, and A. Van Deursen, “The prevalence of code smells in machine learning projects,” in 2021 IEEE/ACM 1st Workshop on AI Engineering-Software Engineering for AI (WAIN).   IEEE, 2021, pp. 1–8.
  21. N. Cardozo, I. Dusparic, and C. Cabrera, “Prevalence of code smells in reinforcement learning projects,” arXiv preprint arXiv:2303.10236, 2023.
  22. S. Li, J. Guo, J.-G. Lou, M. Fan, T. Liu‡, and D. Zhang, “Testing Machine Learning Systems in Industry: An Empirical Study,” in 2022 IEEE/ACM 44th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), 2022, pp. 263–272.
  23. H. B. Braiek and F. Khomh, “On testing machine learning programs,” Journal of Systems and Software (JSS), vol. 164, p. 110542, 2020.
  24. M. Eck, F. Palomba, M. Castelluccio, and A. Bacchelli, “Understanding flaky tests: The developer’s perspective,” in Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2019, pp. 830–840.
  25. A. Hora, R. Robbes, M. T. Valente, N. Anquetil, A. Etien, and S. Ducasse, “How do developers react to api evolution? a large-scale empirical study,” Software Quality Journal, vol. 26, pp. 161–191, 2018.
  26. C. Wan, S. Liu, H. Hoffmann, M. Maire, and S. Lu, “Are machine learning cloud apis used correctly?” in 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE).   IEEE, 2021, pp. 125–137.
  27. W. Jiang, V. Banna, N. Vivek, A. Goel, N. Synovic, G. K. Thiruvathukal, and J. C. Davis, “Challenges and practices of deep learning model reengineering: A case study on computer vision,” arXiv preprint arXiv:2303.07476, 2023.
  28. Z. Yang, C. Wang, J. Shi, T. Hoang, P. Kochhar, Q. Lu, Z. Xing, and D. Lo, “What do users ask in open-source ai repositories? an empirical study of github issues,” arXiv preprint arXiv:2303.09795, 2023.
  29. M. Liu, C. Zhao, X. Peng, S. Yu, H. Wang, and C. Sha, “Task-oriented ml/dl library recommendation based on a knowledge graph,” IEEE Transactions on Software Engineering, pp. 1–16, 2023.
  30. Z. Zhang, L. K. Ng, B. Liu, Y. Cai, D. Li, Y. Guo, and X. Chen, “Teeslice: slicing dnn models for secure and efficient deployment,” in Proceedings of the 2nd ACM International Workshop on AI and Software Testing/Analysis, 2022, pp. 1–8.
  31. D. Velasco-Montero, J. Fernández-Berni, R. Carmona-Galán, and Á. Rodríguez-Vázquez, “Optimum selection of dnn model and framework for edge inference,” IEEE Access, vol. 6, pp. 51 680–51 692, 2018.
  32. A. Kathikar, A. Nair, B. Lazarine, A. Sachdeva, and S. Samtani, “Assessing the vulnerabilities of the open-source artificial intelligence (ai) landscape: A large-scale analysis of the hugging face platform,” 2023.
  33. J. Castaño, S. Martínez-Fernández, X. Franch, and J. Bogner, “Exploring the carbon footprint of hugging face’s ml models: A repository mining study,” arXiv, 2023. [Online]. Available: https://arxiv.org/pdf/2305.11164.pdf
  34. J. C. Davis, P. Jajal, W. Jiang, T. R. Schorlemmer, N. Synovic, and G. K. Thiruvathukal, “Reusing deep learning models: Challenges and directions in software engineering,” in Proceedings of the IEEE John Vincent Atanasoff Symposium on Modern Computing (JVA’23), 2023.
  35. A. Ait, J. L. C. Izquierdo, and J. Cabot, “Hfcommunity: A tool to analyze the hugging face hub community,” in 2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER).   IEEE, 2023, pp. 728–732.
  36. L. Li, J. Wang, and H. Quan, “Scalpel: The python static analysis framework,” arXiv preprint arXiv:2202.11840, 2022.
Citations (1)

Summary

We haven't generated a summary for this paper yet.