Are Decoder-Only Large Language Models the Silver Bullet for Code Search? (2410.22240v1)
Abstract: Code search is crucial for code reuse, enabling developers to efficiently locate relevant snippets. Current methods rely on encoder-based models, which suffer from limitations such as poor generalization and restricted input lengths. Decoder-only LLMs, with their extensive pre-training, larger size, and longer input capabilities, offer potential solutions to these issues, yet their effectiveness in code search remains underexplored. To fill this gap, our study presents the first systematic exploration of decoder-only LLMs for code search. We evaluate nine state-of-the-art decoder-only models using two fine-tuning methods, two datasets (CSN and CoSQA$+$), and three model sizes. Our findings reveal that fine-tuned CodeGemma significantly outperforms encoder-only models like UniXcoder, achieving a 5.57% improvement in MRR on CSN and a 49.6% increase in MAP on CoSQA$+$ compared to zero-shot UniXcoder. These results highlight the superior performance and adaptability of decoder-only models. Additionally, we provide valuable insights into optimizing these models for code search, covering aspects such as model selection, fine-tuning methods, training data, and model size, and discussing their strengths and limitations.
- Y. Xie, J. Lin, H. Dong, L. Zhang, and Z. Wu, “Survey of code search based on deep learning,” ACM Transactions on Software Engineering and Methodology, vol. 33, no. 2, pp. 1–42, 2023.
- K. Lee, “Accelerating onboarding,” 2022.
- OpenAI, “Chatgpt: Optimizing language models for dialogue,” 2023. [Online]. Available: https://openai.com/chatgpt
- D. Guo, Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen, X. Bi, Y. Wu, Y. Li et al., “Deepseek-coder: When the large language model meets programming–the rise of code intelligence,” arXiv preprint arXiv:2401.14196, 2024.
- Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Sun, and H. Wang, “Retrieval-augmented generation for large language models: A survey,” arXiv preprint arXiv:2312.10997, 2023.
- J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le et al., “Program synthesis with large language models,” arXiv preprint arXiv:2108.07732, 2021.
- E. Shi, Y. Wang, W. Gu, L. Du, H. Zhang, S. Han, D. Zhang, and H. Sun, “Cocosoda: Effective contrastive learning for code search,” in 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2023, pp. 2198–2210.
- D. Guo, S. Ren, S. Lu, Z. Feng, D. Tang, S. Liu, L. Zhou, N. Duan, A. Svyatkovskiy, S. Fu et al., “Graphcodebert: Pre-training code representations with data flow,” arXiv preprint arXiv:2009.08366, 2020.
- Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jiang et al., “Codebert: A pre-trained model for programming and natural languages,” arXiv preprint arXiv:2002.08155, 2020.
- D. Guo, S. Lu, N. Duan, Y. Wang, M. Zhou, and J. Yin, “Unixcoder: Unified cross-modal pre-training for code representation,” arXiv preprint arXiv:2203.03850, 2022.
- W. U. Ahmad, S. Chakraborty, B. Ray, and K.-W. Chang, “Unified pre-training for program understanding and generation,” arXiv preprint arXiv:2103.06333, 2021.
- M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman et al., “Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374, 2021.
- J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoffmann, F. Song, J. Aslanides, S. Henderson, R. Ring, S. Young et al., “Scaling language models: Methods, analysis & insights from training gopher,” arXiv preprint arXiv:2112.11446, 2021.
- Z. Zheng, K. Ning, J. Chen, Y. Wang, W. Chen, L. Guo, and W. Wang, “Towards an understanding of large language models in software engineering tasks,” arXiv preprint arXiv:2308.11396, 2023.
- Z. Zheng, K. Ning, Y. Wang, J. Zhang, D. Zheng, M. Ye, and J. Chen, “A survey of large language models for code: Evolution, benchmarking, and future trends,” arXiv preprint arXiv:2311.10372, 2023.
- X. Bi, D. Chen, G. Chen, S. Chen, D. Dai, C. Deng, H. Ding, K. Dong, Q. Du, Z. Fu et al., “Deepseek llm: Scaling open-source language models with longtermism,” arXiv preprint arXiv:2401.02954, 2024.
- B. Rozière, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. Tan, Y. Adi, J. Liu, T. Remez, J. Rapin, A. Kozhevnikov, I. Evtimov, J. Bitton, M. P. Bhatt, C. C. Ferrer, A. Grattafiori, W. Xiong, A. D’efossez, J. Copet, F. Azhar, H. Touvron, L. Martin, N. Usunier, T. Scialom, and G. Synnaeve, “Code llama: Open foundation models for code,” ArXiv, vol. abs/2308.12950, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2308.12950
- P. BehnamGhader, V. Adlakha, M. Mosbach, D. Bahdanau, N. Chapados, and S. Reddy, “Llm2vec: Large language models are secretly powerful text encoders,” arXiv preprint arXiv:2404.05961, 2024.
- Georgepitt, “Decoderllms-codesearch,” 2024, accessed: 2024-10-17. [Online]. Available: https://github.com/Georgepitt/DecoderLLMs-CodeSearch
- S. Chatterjee, S. Juvekar, and K. Sen, “Sniff: A search engine for java using free-form queries,” in Fundamental Approaches to Software Engineering: 12th International Conference, FASE 2009, Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2009, York, UK, March 22-29, 2009. Proceedings 12. Springer, 2009, pp. 385–400.
- X. Gu, H. Zhang, and S. Kim, “Deep code search,” in Proceedings of the 40th International Conference on Software Engineering, 2018, pp. 933–944.
- C. Watson, N. Cooper, D. N. Palacio, K. Moran, and D. Poshyvanyk, “A systematic literature review on the use of deep learning in software engineering research,” ACM Transactions on Software Engineering and Methodology (TOSEM), vol. 31, no. 2, pp. 1–58, 2022.
- W. Ye, R. Xie, J. Zhang, T. Hu, X. Wang, and S. Zhang, “Leveraging code generation to improve code retrieval and summarization via dual learning,” in Proceedings of The Web Conference 2020, 2020, pp. 2309–2319.
- J. Shuai, L. Xu, C. Liu, M. Yan, X. Xia, and Y. Lei, “Improving code search with co-attentive representation learning,” in Proceedings of the 28th International Conference on Program Comprehension, 2020, pp. 196–207.
- W. Li, H. Qin, S. Yan, B. Shen, and Y. Chen, “Learning code-query interaction for enhancing code searches,” in 2020 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 2020, pp. 115–126.
- C. Ling, Z. Lin, Y. Zou, and B. Xie, “Adaptive deep code search,” in Proceedings of the 28th International Conference on Program Comprehension, 2020, pp. 48–59.
- Y. Wan, J. Shu, Y. Sui, G. Xu, Z. Zhao, J. Wu, and P. Yu, “Multi-modal attention network learning for semantic source code retrieval,” in 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2019, pp. 13–25.
- X. Ling, L. Wu, S. Wang, G. Pan, T. Ma, F. Xu, A. X. Liu, C. Wu, and S. Ji, “Deep graph matching and searching for semantic code retrieval,” ACM Transactions on Knowledge Discovery from Data (TKDD), vol. 15, no. 5, pp. 1–21, 2021.
- L. Du, X. Shi, Y. Wang, E. Shi, S. Han, and D. Zhang, “Is a single model enough? mucos: A multi-model ensemble learning approach for semantic code search,” in Proceedings of the 30th ACM International Conference on Information & Knowledge Management, 2021, pp. 2994–2998.
- Q. Zhu, Z. Sun, X. Liang, Y. Xiong, and L. Zhang, “Ocor: An overlapping-aware code retriever,” in Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering, 2020, pp. 883–894.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
- H. Husain, H.-H. Wu, T. Gazit, M. Allamanis, and M. Brockschmidt, “Codesearchnet challenge: Evaluating the state of semantic code search,” arXiv preprint arXiv:1909.09436, 2019.
- N. D. Bui, Y. Yu, and L. Jiang, “Self-supervised contrastive learning for code retrieval and summarization via semantic-preserving transformations,” in Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2021, pp. 511–521.
- Z. Sun, L. Li, Y. Liu, X. Du, and L. Li, “On the importance of building high-quality training datasets for neural code search,” in Proceedings of the 44th International Conference on Software Engineering, 2022, pp. 1609–1620.
- X. Ma, L. Wang, N. Yang, F. Wei, and J. Lin, “Fine-tuning llama for multi-stage text retrieval,” in Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2024, pp. 2421–2425.
- Llama2, 2023. [Online]. Available: https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/
- L. Wang, N. Yang, X. Huang, L. Yang, R. Majumder, and F. Wei, “Improving text embeddings with large language models,” arXiv preprint arXiv:2401.00368, 2023.
- J. M. Springer, S. Kotha, D. Fried, G. Neubig, and A. Raghunathan, “Repetition improves language model embeddings,” arXiv preprint arXiv:2402.15449, 2024.
- J. Gong, Y. Wu, L. Liang, Z. Zheng, and Y. Wang, “Cosqa+: Enhancing code search dataset with matching code,” arXiv preprint arXiv:2406.11589, 2024.
- J. Huang, D. Tang, L. Shou, M. Gong, K. Xu, D. Jiang, M. Zhou, and N. Duan, “Cosqa: 20,000+ web queries for code search and question answering,” arXiv preprint arXiv:2105.13239, 2021.
- Z. Yao, D. S. Weld, W.-P. Chen, and H. Sun, “Staqc: A systematically mined question-code dataset from stack overflow,” in Proceedings of the 2018 World Wide Web Conference, 2018, pp. 1693–1703.
- E. M. Voorhees et al., “The trec-8 question answering track report.” in Trec, vol. 99, 1999, pp. 77–82.
- K. A. Hambarde and H. Proenca, “Information retrieval: recent advances and beyond,” IEEE Access, 2023.
- Meta, “Meta-llama-3-8b-instruct,” https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct, 2024.
- Mistral. (2024) Mistral-7b-instruct-v0.2. [Online]. Available: https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2
- DeepSeek-AI, “Deepseek-llm-7b-chat,” https://huggingface.co/deepseek-ai/deepseek-llm-7b-chat, 2024.
- Google, “Gemma-7b-it,” https://huggingface.co/google/gemma-7b-it, 2024.
- Meta, “Llama-2-7b-hf,” https://huggingface.co/meta-llama/Llama-2-7b-hf, 2024.
- TheBloke, “Mistral-7b-instruct-v0.2-code-ft-gguf,” https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-code-ft-GGUF, 2024.
- DeepSeek-AI, “Deepseek-coder-6.7b-instruct,” https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-instruct, 2024.
- Google, “Codegemma-7b-it,” https://huggingface.co/google/codegemma-7b-it, 2024.
- B. Rozière, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y. Adi, J. Liu, T. Remez, J. Rapin et al., “Code llama: Open foundation models for code,” arXiv preprint arXiv:2308.12950, 2023.
- Microsoft, “Codebert-base,” https://huggingface.co/microsoft/codebert-base, 2024.
- ——, “Unixcoder-base,” https://huggingface.co/microsoft/unixcoder-base, 2024.
- H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
- A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. d. l. Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier et al., “Mistral 7b,” arXiv preprint arXiv:2310.06825, 2023.
- Code-Mistral, 2024. [Online]. Available: https://huggingface.co/ajibawa-2023/Code-Mistral-7B
- . Ajibawa, “Code-290k-sharegpt,” https://huggingface.co/datasets/ajibawa-2023/Code-290k-ShareGPT, 2024.
- G. Team, T. Mesnard, C. Hardin, R. Dadashi, S. Bhupatiraju, S. Pathak, L. Sifre, M. Rivière, M. S. Kale, J. Love et al., “Gemma: Open models based on gemini research and technology,” arXiv preprint arXiv:2403.08295, 2024.
- C. T. H. Zhao, J. Hui, J. Howland, N. Nguyen, S. Zuo, A. Hu, C. A. Choquette-Choo, J. Shen, J. Kelley, K. Bansal, L. Vilnis, M. Wirth, P. Michel, P. Choy, P. Joshi, R. Kumar, S. Hashmi, S. Agrawal, Z. Gong, J. Fine, T. B. Warkentin, A. J. Hartman, B. Ni, K. Korevec, K. Schaefer, and S. Huffman, “Codegemma: Open code models based on gemma,” ArXiv, vol. abs/2406.11409, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2406.11409
- A. Dubey, A. Jauhri, A. Pandey, and et al., “The llama 3 herd of models,” 2024. [Online]. Available: https://arxiv.org/abs/2407.21783
- Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
- M. Hasan, T. Muttaqueen, A. A. Ishtiaq, K. S. Mehrab, M. M. A. Haque, T. Hasan, W. U. Ahmad, A. Iqbal, and R. Shahriyar, “Codesc: A large code-description parallel dataset,” arXiv preprint arXiv:2105.14220, 2021.
- “MTEB Leaderboard,” https://huggingface.co/spaces/mteb/leaderboard.
- T. Gao, X. Yao, and D. Chen, “Simcse: Simple contrastive learning of sentence embeddings,” arXiv preprint arXiv:2104.08821, 2021.
- B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal et al., “Language models are few-shot learners,” arXiv preprint arXiv:2005.14165, vol. 1, 2020.
- P. with Code. (2024) Code generation on mbpp. Accessed: August 1, 2024. [Online]. Available: https://paperswithcode.com/sota/code-generation-on-mbpp