Ecosystem of Large Language Models for Code (2405.16746v2)
Abstract: The availability of vast amounts of publicly accessible data of source code and the advances in modern LLMs, coupled with increasing computational resources, have led to a remarkable surge in the development of LLMs for code (LLM4Code, for short). The interaction between code datasets and models gives rise to a complex ecosystem characterized by intricate dependencies that are worth studying. This paper introduces a pioneering analysis of the code model ecosystem. Utilizing Hugging Face -- the premier hub for transformer-based models -- as our primary source, we curate a list of datasets and models that are manually confirmed to be relevant to software engineering. By analyzing the ecosystem, we first identify the popular and influential datasets, models, and contributors. The popularity is quantified by various metrics, including the number of downloads, the number of likes, the number of reuses, etc. The ecosystem follows a power-law distribution, indicating that users prefer widely recognized models and datasets. Then, we manually categorize how models in the ecosystem are reused into nine categories, analyzing prevalent model reuse practices. The top 3 most popular reuse types are fine-tuning, architecture sharing, and quantization. We also explore the practices surrounding the publication of LLM4Code, specifically focusing on documentation practice and license selection. We find that the documentation in the ecosystem contains less information than that in general AI-related repositories hosted on GitHub. Additionally, the license usage is also different from other software repositories. Models in the ecosystem adopt some AI-specific licenses, e.g., RAIL (Responsible AI Licenses) and AI model license agreement.
- Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jiang, and M. Zhou, “CodeBERT: A pre-trained model for programming and natural languages,” in Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Nov. 2020, pp. 1536–1547.
- E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y. Zhou, S. Savarese, and C. Xiong, “Codegen: An open large language model for code with multi-turn program synthesis,” in The Eleventh International Conference on Learning Representations, 2023.
- Y. Wang, W. Wang, S. Joty, and S. C. Hoi, “Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, 2021.
- H. Husain, H.-H. Wu, T. Gazit, M. Allamanis, and M. Brockschmidt, “CodeSearchNet challenge: Evaluating the state of semantic code search,” arXiv preprint arXiv:1909.09436, 2019.
- S. Lu, D. Guo, S. Ren, J. Huang, A. Svyatkovskiy, A. Blanco, C. B. Clement, D. Drain, D. Jiang, D. Tang, G. Li, L. Zhou, L. Shou, L. Zhou, M. Tufano, M. Gong, M. Zhou, N. Duan, N. Sundaresan, S. K. Deng, S. Fu, and S. Liu, “Codexglue: A machine learning benchmark dataset for code understanding and generation,” CoRR, 2021.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, Jun. 2019, pp. 4171–4186. [Online]. Available: https://aclanthology.org/N19-1423
- Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized BERT pretraining approach,” CoRR, vol. abs/1907.11692, 2019. [Online]. Available: http://arxiv.org/abs/1907.11692
- A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” 2019.
- W. Jiang, N. Synovic, M. Hyatt, T. R. Schorlemmer, R. Sethi, Y.-H. Lu, G. K. Thiruvathukal, and J. C. Davis, “An empirical study of pre-trained model reuse in the hugging face deep learning model registry,” 2023.
- W. Jiang, N. Synovic, P. Jajal, T. R. Schorlemmer, A. Tewari, B. Pareek, G. K. Thiruvathukal, and J. C. Davis, “Ptmtorrent: A dataset for mining open-source pre-trained model packages,” 2023.
- E. J. Hu, yelong shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” in International Conference on Learning Representations, 2022. [Online]. Available: https://openreview.net/forum?id=nZeVKeeFYf9
- Z. Yang, Z. Sun, T. Z. Yue, P. Devanbu, and D. Lo, “Robustness, security, privacy, explainability, efficiency, and usability of large language models for code,” 2024.
- Y. Fan, X. Xia, D. Lo, A. E. Hassan, and S. Li, “What makes a popular academic ai repository?” Empirical Software Engineering, vol. 26, no. 1, pp. 1–35, 2021.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
- R. Li, L. B. Allal, and Y. Z. et al., “Starcoder: may the source be with you!” 2023.
- C. Niu, C. Li, V. Ng, D. Chen, J. Ge, and B. Luo, “An empirical comparison of pre-trained models of source code,” 2023.
- Z. Zeng, H. Tan, H. Zhang, J. Li, Y. Zhang, and L. Zhang, “An extensive study on pre-trained models for program understanding and generation,” in Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, ser. ISSTA 2022. New York, NY, USA: Association for Computing Machinery, 2022, p. 39–51. [Online]. Available: https://doi.org/10.1145/3533767.3534390
- J. He, Z. Xin, B. Xu, T. Zhang, K. Kim, Z. Yang, F. Thung, I. Irsan, and D. Lo, “Representation learning for stack overflow posts: How far are we?” 2023.
- X. Hou, Y. Zhao, Y. Liu, Z. Yang, K. Wang, L. Li, X. Luo, D. Lo, J. Grundy, and H. Wang, “Large language models for software engineering: A systematic literature review,” 2023.
- T. Brown, B. Mann, and N. e. a. Ryder, “Language models are few-shot learners,” in Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33. Curran Associates, Inc., 2020, pp. 1877–1901.
- W. Jiang, N. Synovic, R. Sethi, A. Indarapu, M. Hyatt, T. R. Schorlemmer, G. K. Thiruvathukal, and J. C. Davis, “An empirical study of artifacts and security risks in the pre-trained model supply chain,” in Proceedings of the 2022 ACM Workshop on Software Supply Chain Offensive Research and Ecosystem Defenses, 2022, pp. 105–114.
- W. Ma, L. Chen, X. Zhang, Y. Feng, Z. Xu, Z. Chen, Y. Zhou, and B. Xu, “Impact analysis of cross-project bugs on software ecosystems,” in 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE), 2020, pp. 100–111.
- M. Alfadel, D. E. Costa, and E. Shihab, “Empirical analysis of security vulnerabilities in python packages,” in 2021 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), 2021, pp. 446–457.
- Y. Wang, M. Wen, Y. Liu, Y. Wang, Z. Li, C. Wang, H. Yu, S.-C. Cheung, C. Xu, and Z. Zhu, “Watchman: Monitoring dependency conflicts for python library ecosystem,” in 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE), 2020, pp. 125–135.
- A. Trockman, S. Zhou, C. Kästner, and B. Vasilescu, “Adding sparkle to social coding: An empirical study of repository badges in the npm ecosystem,” in 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE), 2018, pp. 511–522.
- C. Liu, S. Chen, L. Fan, B. Chen, Y. Liu, and X. Peng, “Demystifying the vulnerability propagation and its evolution via dependency trees in the npm ecosystem,” in 2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE), 2022, pp. 672–684.
- I. S. Makari, A. Zerouali, and C. De Roover, “Prevalence and evolution of license violations in npm and rubygems dependency networks,” in Reuse and Software Quality, G. Perrouin, N. Moha, and A.-D. Seriai, Eds. Cham: Springer International Publishing, 2022, pp. 85–100.
- A. M. Mir, M. Keshani, and S. Proksch, “On the effect of transitivity and granularity on vulnerability propagation in the maven ecosystem,” in 2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). Los Alamitos, CA, USA: IEEE Computer Society, mar 2023, pp. 201–211. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/SANER56733.2023.00028
- D. A. d’Aragona, F. Pecorelli, M. T. Baldassarre, D. Taibi, and V. Lenarduzzi, “Technical debt diffuseness in the apache ecosystem: A differentiated replication,” in 2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), 2023, pp. 825–833.
- F. Massacci and I. Pashchenko, “Technical leverage in a software ecosystem: Development opportunities and security risks,” in 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 2021, pp. 1386–1397.
- J. Cito, G. Schermann, J. E. Wittern, P. Leitner, S. Zumberi, and H. C. Gall, “An empirical analysis of the docker container ecosystem on github,” in 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR), 2017, pp. 323–333.
- R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill et al., “On the opportunities and risks of foundation models,” arXiv preprint arXiv:2108.07258, 2021.
- R. Bommasani, D. Soylu, T. I. Liao, K. A. Creel, and P. Liang, “Ecosystem graphs: The social footprint of foundation models,” 2023.
- L. B. Allal, R. Li, and D. K. et al., “Santacoder: don’t reach for the stars!” 2023.
- S. Jalali and C. Wohlin, “Systematic literature studies: Database searches vs. backward snowballing,” in Proceedings of the 2012 ACM-IEEE International Symposium on Empirical Software Engineering and Measurement, 2012, pp. 29–38.
- C. Wohlin, “Guidelines for snowballing in systematic literature studies and a replication in software engineering,” in Proceedings of the 18th International Conference on Evaluation and Assessment in Software Engineering, ser. EASE ’14. New York, NY, USA: Association for Computing Machinery, 2014. [Online]. Available: https://doi.org/10.1145/2601248.2601268
- A. Begel and T. Zimmermann, “Analyze this! 145 questions for data scientists in software engineering,” in Proceedings of the 36th International Conference on Software Engineering, ser. ICSE 2014. New York, NY, USA: Association for Computing Machinery, 2014, p. 12–23. [Online]. Available: https://doi.org/10.1145/2568225.2568233
- S. Breu, R. Premraj, J. Sillito, and T. Zimmermann, “Information needs in bug reports: Improving cooperation between developers and users,” in Proceedings of the 2010 ACM Conference on Computer Supported Cooperative Work, ser. CSCW ’10. New York, NY, USA: Association for Computing Machinery, 2010, p. 301–310. [Online]. Available: https://doi.org/10.1145/1718918.1718973
- Z. Yang, C. Wang, J. Shi, T. Hoang, P. Kochhar, Q. Lu, Z. Xing, and D. Lo, “What do users ask in open-source ai repositories? an empirical study of github issues,” in Proceedings of the 20th International Conference on Mining Software Repositories, ser. MSR ’23, 2023.
- K. Aggarwal, A. Hindle, and E. Stroulia, “Co-evolution of project documentation and popularity within github,” in Proceedings of the 11th working conference on mining software repositories, 2014, pp. 360–363.
- A. Pietri, D. Spinellis, and S. Zacchiroli, “The software heritage graph dataset: Large-scale analysis of public software development history,” in Proceedings of the 17th International Conference on Mining Software Repositories, ser. MSR ’20. New York, NY, USA: Association for Computing Machinery, 2020, p. 1–5. [Online]. Available: https://doi.org/10.1145/3379597.3387510
- A.-L. Barabási and R. Albert, “Emergence of scaling in random networks,” science, vol. 286, no. 5439, pp. 509–512, 1999.
- X. Wei, S. Gonugondla, W. Ahmad, S. Wang, B. Ray, H. Qian, X. Li, V. Kumar, Z. Wang, Y. Tian et al., “Greener yet powerful: Taming large code generation models with quantization,” arXiv preprint arXiv:2303.05378, 2023.
- J. Shi, Z. Yang, B. Xu, H. J. Kang, and D. Lo, “Compressing pre-trained models of code into 3 mb,” ser. ASE ’22. New York, NY, USA: Association for Computing Machinery, 2023. [Online]. Available: https://doi.org/10.1145/3551349.3556964
- Z. Yang, J. Shi, J. He, and D. Lo, “Natural attack for pre-trained models of code,” in Proceedings of the 44th International Conference on Software Engineering, ser. ICSE ’22. New York, NY, USA: Association for Computing Machinery, 2022, p. 1482–1493. [Online]. Available: https://doi.org/10.1145/3510003.3510146
- Z. Tian, J. Chen, and Z. Jin, “Code difference guided adversarial example generation for deep code models,” pp. 850–862, 2023.
- J. M. Gonzalez-Barahona, S. Montes-Leon, G. Robles, and S. Zacchiroli, “The software heritage license dataset (2022 edition),” Empirical Software Engineering, vol. 28, no. 6, p. 147, 2023. [Online]. Available: https://doi.org/10.1007/s10664-023-10377-w
- C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” Journal of Machine Learning Research, vol. 21, no. 140, pp. 1–67, 2020. [Online]. Available: http://jmlr.org/papers/v21/20-074.html
- A. Hindle, E. T. Barr, Z. Su, M. Gabel, and P. Devanbu, “On the naturalness of software,” in Proceedings of the 34th International Conference on Software Engineering, ser. ICSE ’12. IEEE Press, 2012, p. 837–847.
- D. Kocetkov, R. Li, L. B. Allal, J. Li, C. Mou, C. M. Ferrandis, Y. Jernite, M. Mitchell, S. Hughes, T. Wolf, D. Bahdanau, L. von Werra, and H. de Vries, “The stack: 3 tb of permissively licensed source code,” 2022.
- A. Kanade, P. Maniatis, G. Balakrishnan, and K. Shi, “Learning and evaluating contextual embedding of source code,” in International Conference on Machine Learning. PMLR, 2020, pp. 5110–5121.
- T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” in 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings, Y. Bengio and Y. LeCun, Eds., 2013. [Online]. Available: http://arxiv.org/abs/1301.3781
- L. Buratti, S. Pujar, M. A. Bornea, J. S. McCarley, Y. Zheng, G. Rossiello, A. Morari, J. Laredo, V. Thost, Y. Zhuang, and G. Domeniconi, “Exploring software naturalness through neural language models,” CoRR, vol. abs/2006.12641, 2020. [Online]. Available: https://arxiv.org/abs/2006.12641
- J. Tabassum, M. Maddela, W. Xu, and A. Ritter, “Code and named entity recognition in StackOverflow,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, Jul. 2020, pp. 4913–4926. [Online]. Available: https://aclanthology.org/2020.acl-main.443
- D. Guo, S. Ren, S. Lu, Z. Feng, D. Tang, S. Liu, L. Zhou, N. Duan, A. Svyatkovskiy, S. F. andz Michele Tufano, S. K. Deng, C. B. Clement, D. Drain, N. Sundaresan, J. Yin, D. Jiang, and M. Zhou, “Graphcodebert: Pre-training code representations with data flow,” in 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021, 2021.
- X. Zhou, D. Han, and D. Lo, “Assessing generalizability of codebert,” in 2021 IEEE International Conference on Software Maintenance and Evolution (ICSME), 2021, pp. 425–436.
- A. Elnaggar, W. Ding, L. Jones, T. Gibbs, T. Feher, C. Angerer, S. Severini, F. Matthes, and B. Rost, “Codetrans: Towards cracking the language of silicon’s code through self-supervised deep learning and high performance computing,” 2021.
- D. Fried, A. Aghajanyan, J. Lin, S. Wang, E. Wallace, F. Shi, R. Zhong, S. Yih, L. Zettlemoyer, and M. Lewis, “Incoder: A generative model for code infilling and synthesis,” in The Eleventh International Conference on Learning Representations, 2023. [Online]. Available: https://openreview.net/forum?id=hQwb-lbM6EL
- L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima, S. Presser, and C. Leahy, “The Pile: An 800gb dataset of diverse text for language modeling,” arXiv preprint arXiv:2101.00027, 2020.
- C. Sadowski, K. T. Stolee, and S. Elbaum, “How developers search for code: A case study,” in Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, ser. ESEC/FSE 2015. New York, NY, USA: Association for Computing Machinery, 2015, p. 191–201. [Online]. Available: https://doi.org/10.1145/2786805.2786855
- M. Yang, Y. Zhou, B. Li, and Y. Tang, “On code reuse from stackoverflow: An exploratory study on jupyter notebook,” arXiv preprint arXiv:2302.11732, 2023.
- R. Abdalkareem, O. Nourry, S. Wehaibi, S. Mujahid, and E. Shihab, “Why do developers use trivial packages? an empirical case study on npm,” in Proceedings of the 2017 11th joint meeting on foundations of software engineering, 2017, pp. 385–395.
- Y. Wang, B. Chen, K. Huang, B. Shi, C. Xu, X. Peng, Y. Wu, and Y. Liu, “An empirical study of usages, updates and risks of third-party libraries in java projects,” in 2020 IEEE International Conference on Software Maintenance and Evolution (ICSME), 2020, pp. 35–45.
- A. Alnusair, T. Zhao, and E. Bodden, “Effective api navigation and reuse,” in 2010 IEEE International Conference on Information Reuse & Integration, 2010, pp. 7–12.
- J. C. Davis, P. Jajal, W. Jiang, T. R. Schorlemmer, N. Synovic, and G. K. Thiruvathukal, “Reusing deep learning models: Challenges and directions in software engineering,” in Proceedings of the IEEE John Vincent Atanasoff Symposium on Modern Computing (JVA’23), 2023.
- G. Ramakrishnan and A. Albarghouthi, “Backdoors in neural models of source code,” in 2022 26th International Conference on Pattern Recognition (ICPR). Los Alamitos, CA, USA: IEEE Computer Society, aug 2022, pp. 2892–2899. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/ICPR56361.2022.9956690
- Z. Yang, B. Xu, J. M. Zhang, H. J. Kang, J. Shi, J. He, and D. Lo, “Stealthy backdoor attack for code models,” IEEE Transactions on Software Engineering, no. 01, pp. 1–21, feb.
- Y. Wan, S. Zhang, H. Zhang, Y. Sui, G. Xu, D. Yao, H. Jin, and L. Sun, “You see what i want you to see: Poisoning vulnerabilities in neural code search,” in Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ser. ESEC/FSE 2022. New York, NY, USA: Association for Computing Machinery, 2022, p. 1233–1245. [Online]. Available: https://doi.org/10.1145/3540250.3549153
- Z. Wan, X. Xia, D. Lo, and G. C. Murphy, “How does machine learning change software development practices?” IEEE Transactions on Software Engineering, vol. 47, no. 9, pp. 1857–1871, 2021.
- A. Serban, K. van der Blom, H. Hoos, and J. Visser, “Adoption and effects of software engineering best practices in machine learning,” in Proceedings of the 14th ACM / IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), ser. ESEM ’20. New York, NY, USA: Association for Computing Machinery, 2020. [Online]. Available: https://doi.org/10.1145/3382494.3410681