Automated categorization of pre-trained models for software engineering: A case study with a Hugging Face dataset (2405.13185v1)
Abstract: Software engineering (SE) activities have been revolutionized by the advent of pre-trained models (PTMs), defined as large ML models that can be fine-tuned to perform specific SE tasks. However, users with limited expertise may need help to select the appropriate model for their current task. To tackle the issue, the Hugging Face (HF) platform simplifies the use of PTMs by collecting, storing, and curating several models. Nevertheless, the platform currently lacks a comprehensive categorization of PTMs designed specifically for SE, i.e., the existing tags are more suited to generic ML categories. This paper introduces an approach to address this gap by enabling the automatic classification of PTMs for SE tasks. First, we utilize a public dump of HF to extract PTMs information, including model documentation and associated tags. Then, we employ a semi-automated method to identify SE tasks and their corresponding PTMs from existing literature. The approach involves creating an initial mapping between HF tags and specific SE tasks, using a similarity-based strategy to identify PTMs with relevant tags. The evaluation shows that model cards are informative enough to classify PTMs considering the pipeline tag. Moreover, we provide a mapping between SE tasks and stored PTMs by relying on model names.
- HFCommunity: A Tool to Analyze the Hugging Face Hub Community. In Procs. of SANER 2023. 728–732. https://doi.org/10.1109/SANER56733.2023.00080 ISSN: 2640-7574.
- Michael Buckland and Fredric Gey. 1994. The relationship between recall and precision. Journal of the American society for information science 45, 1 (1994), 12–19. Publisher: Wiley Online Library.
- Analyzing the Evolution and Maintenance of ML Models on Hugging Face. https://doi.org/10.48550/arXiv.2311.13380 arXiv:2311.13380 [cs].
- Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM Trans. on Intelligent Systems and Technology 2, 3 (April 2011), 1–27. https://doi.org/10.1145/1961189.1961199
- Development of recommendation systems for software engineering: the CROSSMINER experience. Empirical Software Engineering 26, 4 (July 2021), 69. https://doi.org/10.1007/s10664-021-09963-7
- HybridRec: A recommender system for tagging GitHub repositories. Applied Intelligence 53, 8 (April 2023), 9708–9730. https://doi.org/10.1007/s10489-022-03864-y
- Replication Package: Automated categorization of pre-trained models for software engineering: A case study with a Hugging Face dataset. https://github.com/MDEGroup/EASE2024-HF-ReplicationPackage
- Understanding Software-2.0: A Study of Machine Learning Library Usage and Evolution. ACM Trans. Softw. Eng. Methodol. 30, 4, Article 55 (jul 2021), 42 pages. https://doi.org/10.1145/3453478
- Can pre-trained code embeddings improve model performance? Revisiting the use of code embeddings in software engineering tasks. Empirical Software Engineering 27, 3 (March 2022), 63. https://doi.org/10.1007/s10664-022-10118-5
- Self-collaboration Code Generation via ChatGPT. arXiv:2304.07590 [cs.SE]
- What Is the Intended Usage Context of This Model? An Exploratory Study of Pre-Trained Models on Various Model Repositories. ACM Trans. on Software Engineering and Methodology 32, 3 (May 2023), 69:1–69:57. https://doi.org/10.1145/3569934
- Pre-trained models: Past, present and future. AI Open 2 (2021), 225–250. https://doi.org/10.1016/j.aiopen.2021.08.002
- MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework. arXiv:2308.00352 [cs.AI]
- Large Language Models for Software Engineering: A Systematic Literature Review. https://doi.org/10.48550/arXiv.2308.10620 arXiv:2308.10620 [cs].
- Semantically-enhanced topic recommendation systems for software projects. Empirical Software Engineering 28, 2 (Feb. 2023), 50. https://doi.org/10.1007/s10664-022-10272-w
- An Empirical Study of Pre-Trained Model Reuse in the Hugging Face Deep Learning Model Registry. In Procs. of ICSE 2023. IEEE Press, Melbourne, Victoria, Australia, 2463–2475. https://doi.org/10.1109/ICSE48619.2023.00206
- RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692 [cs.CL]
- Model Cards for Model Reporting. In Procs. of the Conf. on Fairness, Accountability, and Transparency (Atlanta, GA, USA) (FAT*’19). ACM, 220–229. https://doi.org/10.1145/3287560.3287596
- Discrepancies among pre-trained deep neural networks: a new threat to model zoo reliability. In Procs. of ESEC/FSE 2022. ACM, 1605–1609. https://doi.org/10.1145/3540250.3560881
- Gonzalo Navarro. 2001. A guided tour to approximate string matching. Comput. Surveys 33, 1 (2001), 31–88. https://doi.org/10.1145/375360.375365
- Cross-Validation. Springer US, Boston, MA, 532–538. https://doi.org/10.1007/978-0-387-39940-9_565
- Tackling the Poor Assumptions of Naive Bayes Text Classifiers. (2003).
- Recommendation Systems in Software Engineering. Springer Berlin Heidelberg, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-45135-5
- GitRanking: A ranking of GitHub topics for software classification using active sampling. Software: Practice and Experience 53, 10 (Oct. 2023), 1982–2006. https://doi.org/10.1002/spe.3238
- Using pre-trained models to boost code review automation. In Procs. of the 44th Int. Conf. on Software Engineering (ICSE ’22). ACM, 2291–2302. https://doi.org/10.1145/3510003.3510621
- Topic Recommendation for GitHub Repositories: How Far Can Extreme Multi-Label Learning Go?. In 2023 IEEE Int. Conf. on Software Analysis, Evolution and Reengineering (SANER). IEEE, Taipa, Macao, 167–178. https://doi.org/10.1109/SANER56733.2023.00025
- Using pre-trained language models to resolve textual and semantic merge conflicts (experience paper). In Procs. of ISSTA 2022. ACM, 77–88. https://doi.org/10.1145/3533767.3534396
- GHTRec: A Personalized Service to Recommend GitHub Trending Repositories for Developers. In IEEE Int. Conf. on Web Services. IEEE, 314–323. https://doi.org/10.1109/ICWS53863.2021.00049