LEGION: Harnessing Pre-trained Language Models for GitHub Topic Recommendations with Distribution-Balance Loss (2403.05873v1)
Abstract: Open-source development has revolutionized the software industry by promoting collaboration, transparency, and community-driven innovation. Today, a vast amount of various kinds of open-source software, which form networks of repositories, is often hosted on GitHub - a popular software development platform. To enhance the discoverability of the repository networks, i.e., groups of similar repositories, GitHub introduced repository topics in 2017 that enable users to more easily explore relevant projects by type, technology, and more. It is thus crucial to accurately assign topics for each GitHub repository. Current methods for automatic topic recommendation rely heavily on TF-IDF for encoding textual data, presenting challenges in understanding semantic nuances. This paper addresses the limitations of existing techniques by proposing Legion, a novel approach that leverages Pre-trained LLMs (PTMs) for recommending topics for GitHub repositories. The key novelty of Legion is three-fold. First, Legion leverages the extensive capabilities of PTMs in language understanding to capture contextual information and semantic meaning in GitHub repositories. Second, Legion overcomes the challenge of long-tailed distribution, which results in a bias toward popular topics in PTMs, by proposing a Distribution-Balanced Loss (DB Loss) to better train the PTMs. Third, Legion employs a filter to eliminate vague recommendations, thereby improving the precision of PTMs. Our empirical evaluation on a benchmark dataset of real-world GitHub repositories shows that Legion can improve vanilla PTMs by up to 26% on recommending GitHubs topics. Legion also can suggest GitHub topics more precisely and effectively than the state-of-the-art baseline with an average improvement of 20% and 5% in terms of Precision and F1-score, respectively.
- Toufique Ahmed and Premkumar Devanbu. 2022. Few-shot training LLMs for project-specific code-summarization. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. 1–5.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).
- The Matthews correlation coefficient (MCC) is more informative than Cohen’s Kappa and Brier score in binary classification assessment. Ieee Access 9 (2021), 78368–78381.
- Beware of the unexpected: Bimodal taint analysis. arXiv preprint arXiv:2301.10545 (2023).
- Maria Christakis and Christian Bird. 2016. What developers want and need from program analysis: an empirical study. In Proceedings of the 31st IEEE/ACM international conference on automated software engineering. 332–343.
- ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. In ICLR. https://openreview.net/pdf?id=r1xMH1BtvB
- Social coding in GitHub: transparency and collaboration in an open software repository. In Proceedings of the ACM 2012 conference on computer supported cooperative work. 1277–1286.
- Artifacts: LEGION: Harnessing Pre-trained Language Models for GitHub Topic Recommendations with Distribution-Balance Loss. Figshare. https://figshare.com/s/6e01956fbfcd9b7ca6de
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL.
- Topfilter: an approach to recommend relevant github topics. In Proceedings of the 14th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). 1–11.
- HybridRec: A recommender system for tagging GitHub repositories. Applied Intelligence (2022), 1–23.
- A multinomial naïve bayesian (mnb) network to automatically recommend topics for github repositories. In Proceedings of the Evaluation and Assessment in Software Engineering. 71–80.
- Shay Frendt. 2019. Introducing topics. https://github.blog/2017-01-31-introducing-topics/
- Kavita Ganesan. 2017. Topic suggestions for millions of repositories. https://github.blog/2017-07-31-topics/
- Evaluating Transfer Learning for Simplifying GitHub READMEs. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1548–1560.
- Generative adversarial nets. Advances in neural information processing systems 27 (2014).
- Generalized Zero-Shot Extreme Multi-Label Learning. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining (Virtual Event, Singapore) (KDD ’21). Association for Computing Machinery, New York, NY, USA, 527–535. https://doi.org/10.1145/3447548.3467426
- PTM4Tag: sharpening tag recommendation of stack overflow posts with pre-trained models. In Proceedings of the 30th IEEE/ACM International Conference on Program Comprehension. 1–11.
- Multilabel classification. In Multilabel Classification. Springer, 17–31.
- Large language models for software engineering: A systematic literature review. arXiv preprint arXiv:2308.10620 (2023).
- Prompt-tuned code language model as a neural knowledge base for type inference in statically-typed partial code. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. 1–13.
- Balancing Methods for Multi-label Text Classification with Long-Tailed Class Distribution. 8153–8161. https://doi.org/10.18653/v1/2021.emnlp-main.643
- Topic recommendation for software repositories using multi-label classification algorithms. Empirical Software Engineering 26, 5 (2021), 1–33. https://doi.org/10.1007/s10664-021-09976-2
- AutoPruner: transformer-based call graph pruning. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 520–532.
- Invalidator: Automated patch correctness assessment via semantic and syntactic reasoning. IEEE Transactions on Software Engineering (2023).
- Does bug prediction support human developers? findings from a google case study. In 2013 35th International Conference on Software Engineering (ICSE). IEEE, 372–381.
- BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 7871–7880.
- Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision. 2980–2988.
- Thresholding Classifiers to Maximize F1 Score. arXiv:1402.1892 [stat.ML]
- Multi-task learning based pre-trained language model for code completion. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering. 473–485.
- Refining ChatGPT-generated code: Characterizing and mitigating code quality issues. arXiv preprint arXiv:2307.12596 (2023).
- PRCBERT: Prompt Learning for Requirement Classification using BERT-based Pretrained Language Models. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. 1–13.
- Chronos: Time-Aware Zero-Shot Identification of Libraries from Vulnerability Reports. (2023), 1033–1045. https://doi.org/10.1109/ICSE48619.2023.00094
- CC-News-En: A large English news corpus. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management. 3077–3084.
- Duplicate bug report detection using an attention-based neural language model. IEEE Transactions on Reliability (2022).
- Recent advances in natural language processing via large pre-trained language models: A survey. Comput. Surveys 56, 2 (2023), 1–40.
- Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830.
- Going farther together: The impact of social capital on sustained participation in open source. In 2019 ieee/acm 41st International Conference on Eoftware Engineering (ICSE). IEEE, 688–699.
- Improving language understanding by generative pre-training. (2018).
- Lessons from building static analysis tools at google. Commun. ACM 61, 4 (2018), 58–66.
- DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019).
- Cataloging github repositories. In Proceedings of the 21st International Conference on Evaluation and Assessment in Software Engineering. 314–319.
- Relay backpropagation for effective learning of deep convolutional neural networks. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part VII 14. Springer, 467–482.
- Christoph Treude and Margaret-Anne Storey. 2009. How tagging helps bridge the gap between social and technical aspects in software development. In 2009 IEEE 31st International Conference on Software Engineering. IEEE, 12–22. https://doi.org/10.1109/ICSE.2009.5070504
- Christoph Treude and Margaret-Anne Storey. 2010. Work item tagging: Communicating concerns in collaborative software development. IEEE Transactions on Software Engineering 38, 1 (2010), 19–34.
- Influence of social and technical factors for evaluating contribution in GitHub. In Proceedings of the 36th International Conference on Software engineering (ICSE). 356–366.
- Attention is all you need. Advances in neural information processing systems 30 (2017).
- Bridging pre-trained models and downstream tasks for source code understanding. In Proceedings of the 44th International Conference on Software Engineering. 287–298.
- How well do pre-trained contextual language representations recommend labels for GitHub issues? Knowledge-Based Systems 232 (2021), 107476.
- A Comprehensive Survey of Loss Functions in Machine Learning. Annals of Data Science 9, 2 (April 2022), 187–212. https://doi.org/10.1007/s40745-020-00253-
- EnTagRec++: An enhanced tag recommendation system for software information sites. Empirical Software Engineering 23, 2 (2018), 800–832.
- Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. arXiv preprint arXiv:2109.00859 (2021).
- Topic Recommendation for GitHub Repositories: How Far Can Extreme Multi-Label Learning Go?. In 2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 167–178.
- Learning to Learn and Predict: A Meta-Learning Approach for Multi-Label Classification. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (Eds.). Association for Computational Linguistics, Hong Kong, China, 4354–4364. https://doi.org/10.18653/v1/D19-1444
- Distribution-balanced loss for multi-label classification in long-tailed datasets. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16. Springer, 162–178.
- Tag recommendation in software information sites. In 2013 10th Working Conference on Mining Software Repositories (MSR). IEEE, 287–296.
- Are We Ready to Embrace Generative AI for Software Q&A?. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 1713–1717.
- Sentiment analysis for software engineering: How far can pre-trained transformer models go?. In 2020 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 70–80.
- The Devil is in the Tails: How Long-Tailed Code Distributions Impact Large Language Models. arXiv preprint arXiv:2309.03567 (2023).
- Patchzero: Zero-shot automatic patch correctness assessment. arXiv preprint arXiv:2303.00202 (2023).
- The Devil is in the Tails: How Long-Tailed Code Distributions Impact Large Language Models. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE Computer Society, 40–52.
- Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision. 19–27.
- A Robustly Optimized BERT Pre-training Approach with Post-training. In Proceedings of the 20th Chinese National Conference on Computational Linguistics. 1218–1227.