Refining Joint Text and Source Code Embeddings for Retrieval Task with Parameter-Efficient Fine-Tuning (2405.04126v1)
Abstract: The latest developments in NLP have demonstrated remarkable progress in a code-text retrieval problem. As the Transformer-based models used in this task continue to increase in size, the computational costs and time required for end-to-end fine-tuning become substantial. This poses a significant challenge for adapting and utilizing these models when computational resources are limited. Motivated by these concerns, we propose a fine-tuning framework that leverages Parameter-Efficient Fine-Tuning (PEFT) techniques. Moreover, we adopt contrastive learning objectives to improve the quality of bimodal representations learned by transformer models. Additionally, for PEFT methods we provide extensive benchmarking, the lack of which has been highlighted as a crucial problem in the literature. Based on the thorough experimentation with the CodeT5+ model conducted on two datasets, we demonstrate that the proposed fine-tuning framework has the potential to improve code-text retrieval performance by tuning only 0.4% parameters at most.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Supervised contrastive learning. Advances in neural information processing systems, 33:18661–18673, 2020.
- Codet5+: Open code large language models for code understanding and generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1069–1088, 2023.
- Scaling down to scale up: A guide to parameter-efficient fine-tuning. arXiv preprint arXiv:2303.15647, 2023.
- Low-rank adaptation of large language model rescoring for parameter-efficient speech recognition. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 1–8. IEEE, 2023.
- Adaptive budget allocation for parameter-efficient fine-tuning. arXiv preprint arXiv:2303.10512, 2023.
- The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
- Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems, 35:1950–1965, 2022.
- Codesearchnet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436, 2019.
- Retrieval-augmented generation for knowledge-intensive nlp tasks, 2021.
- A neural framework for retrieval and summarization of source code. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, pages 826–831, 2018.
- Survey of code search based on deep learning. ACM Transactions on Software Engineering and Methodology, 33(2):1–42, 2023.
- Rosf: Leveraging information retrieval and supervised learning for recommending code snippets. IEEE Transactions on Services Computing, 12(1):34–46, 2016.
- Sniff: A search engine for java using free-form queries. In Fundamental Approaches to Software Engineering: 12th International Conference, FASE 2009, Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2009, York, UK, March 22-29, 2009. Proceedings 12, pages 385–400. Springer, 2009.
- Nl-based query refinement and contextualized code search results: A user study. In 2014 Software Evolution Week-IEEE Conference on Software Maintenance, Reengineering, and Reverse Engineering (CSMR-WCRE), pages 34–43. IEEE, 2014.
- Bimodal modelling of source code and natural language. In International conference on machine learning, pages 2123–2132. PMLR, 2015.
- Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155, 2020.
- Graphcodebert: Pre-training code representations with data flow. arXiv preprint arXiv:2009.08366, 2020.
- Comparison of graph embeddings for source code with text models based on cnn and codebert architectures. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS), 35(1):237–264, 2023.
- A novel neural source code representation based on abstract syntax tree. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pages 783–794. IEEE, 2019.
- Retrieval-augmented generation for code summarization via hybrid gnn. arXiv preprint arXiv:2006.05405, 2020.
- Multimodal representation for neural code search. In 2021 IEEE International Conference on Software Maintenance and Evolution (ICSME), pages 483–494. IEEE, 2021.
- Text and code embeddings by contrastive pre-training. arXiv preprint arXiv:2201.10005, 2022.
- Parameter-efficient fine-tuning methods for pretrained language models: A critical review and assessment. arXiv preprint arXiv:2312.12148, 2023.
- Exploring parameter-efficient fine-tuning techniques for code generation with large language models. arXiv preprint arXiv:2308.10462, 2023.
- No more fine-tuning? an experimental evaluation of prompt tuning in code intelligence. In Proceedings of the 30th ACM joint European software engineering conference and symposium on the foundations of software engineering, pages 382–394, 2022.
- Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. arXiv preprint arXiv:2109.00859, 2021.
- Self-supervised learning: Generative or contrastive. IEEE transactions on knowledge and data engineering, 35(1):857–876, 2021.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Contrastive multiview coding. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, pages 776–794. Springer, 2020.
- Graph contrastive learning with augmentations. Advances in neural information processing systems, 33:5812–5823, 2020.
- Contrastive learning with cross-modal knowledge mining for multimodal human activity recognition. In 2022 International Joint Conference on Neural Networks (IJCNN), pages 01–08. IEEE, 2022.
- A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
- Self-supervised contrastive bert fine-tuning for fusion-based reviewed-item retrieval. In European Conference on Information Retrieval, pages 3–17. Springer, 2023.
- Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing, 508:293–304, 2022.
- Xlcost: A benchmark dataset for cross-lingual code intelligence. arXiv preprint arXiv:2206.08474, 2022.
- Hu Yao and et al. Staqc: A systematically mined question-code dataset from stack overflow. In Proceedings of the World Wide Web (WWW’18) Conference, pages 135–144, 2018.
- Pytorrent: A python library corpus for large-scale language models. arXiv preprint arXiv:2110.01710, 2021.
- Search4code: Code search intent classification using weak supervision. In 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR), pages 575–579. IEEE, 2021.
- Semantic code search for smart contracts. arXiv preprint arXiv:2111.14139, 2021.
- Isadetect: Usable automated detection of cpu architecture and endianness for executable binary files and object code. In Proceedings of the Tenth ACM Conference on Data and Application Security and Privacy, CODASPY ’20. ACM, March 2020.
- Deepseek-coder: When the large language model meets programming–the rise of code intelligence. arXiv preprint arXiv:2401.14196, 2024.
- Kavita Ganesan. Rouge 2.0: Updated and improved measures for evaluation of summarization tasks. arXiv preprint arXiv:1803.01937, 2018.