Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
95 tokens/sec
Gemini 2.5 Pro Premium
32 tokens/sec
GPT-5 Medium
18 tokens/sec
GPT-5 High Premium
20 tokens/sec
GPT-4o
97 tokens/sec
DeepSeek R1 via Azure Premium
87 tokens/sec
GPT OSS 120B via Groq Premium
468 tokens/sec
Kimi K2 via Groq Premium
202 tokens/sec
2000 character limit reached

Are Decoder-Only Large Language Models the Silver Bullet for Code Search? (2410.22240v1)

Published 29 Oct 2024 in cs.SE

Abstract: Code search is crucial for code reuse, enabling developers to efficiently locate relevant snippets. Current methods rely on encoder-based models, which suffer from limitations such as poor generalization and restricted input lengths. Decoder-only LLMs, with their extensive pre-training, larger size, and longer input capabilities, offer potential solutions to these issues, yet their effectiveness in code search remains underexplored. To fill this gap, our study presents the first systematic exploration of decoder-only LLMs for code search. We evaluate nine state-of-the-art decoder-only models using two fine-tuning methods, two datasets (CSN and CoSQA$+$), and three model sizes. Our findings reveal that fine-tuned CodeGemma significantly outperforms encoder-only models like UniXcoder, achieving a 5.57% improvement in MRR on CSN and a 49.6% increase in MAP on CoSQA$+$ compared to zero-shot UniXcoder. These results highlight the superior performance and adaptability of decoder-only models. Additionally, we provide valuable insights into optimizing these models for code search, covering aspects such as model selection, fine-tuning methods, training data, and model size, and discussing their strengths and limitations.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (67)
  1. Y. Xie, J. Lin, H. Dong, L. Zhang, and Z. Wu, “Survey of code search based on deep learning,” ACM Transactions on Software Engineering and Methodology, vol. 33, no. 2, pp. 1–42, 2023.
  2. K. Lee, “Accelerating onboarding,” 2022.
  3. OpenAI, “Chatgpt: Optimizing language models for dialogue,” 2023. [Online]. Available: https://openai.com/chatgpt
  4. D. Guo, Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen, X. Bi, Y. Wu, Y. Li et al., “Deepseek-coder: When the large language model meets programming–the rise of code intelligence,” arXiv preprint arXiv:2401.14196, 2024.
  5. Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Sun, and H. Wang, “Retrieval-augmented generation for large language models: A survey,” arXiv preprint arXiv:2312.10997, 2023.
  6. J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le et al., “Program synthesis with large language models,” arXiv preprint arXiv:2108.07732, 2021.
  7. E. Shi, Y. Wang, W. Gu, L. Du, H. Zhang, S. Han, D. Zhang, and H. Sun, “Cocosoda: Effective contrastive learning for code search,” in 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE).   IEEE, 2023, pp. 2198–2210.
  8. D. Guo, S. Ren, S. Lu, Z. Feng, D. Tang, S. Liu, L. Zhou, N. Duan, A. Svyatkovskiy, S. Fu et al., “Graphcodebert: Pre-training code representations with data flow,” arXiv preprint arXiv:2009.08366, 2020.
  9. Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jiang et al., “Codebert: A pre-trained model for programming and natural languages,” arXiv preprint arXiv:2002.08155, 2020.
  10. D. Guo, S. Lu, N. Duan, Y. Wang, M. Zhou, and J. Yin, “Unixcoder: Unified cross-modal pre-training for code representation,” arXiv preprint arXiv:2203.03850, 2022.
  11. W. U. Ahmad, S. Chakraborty, B. Ray, and K.-W. Chang, “Unified pre-training for program understanding and generation,” arXiv preprint arXiv:2103.06333, 2021.
  12. M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman et al., “Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374, 2021.
  13. J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoffmann, F. Song, J. Aslanides, S. Henderson, R. Ring, S. Young et al., “Scaling language models: Methods, analysis & insights from training gopher,” arXiv preprint arXiv:2112.11446, 2021.
  14. Z. Zheng, K. Ning, J. Chen, Y. Wang, W. Chen, L. Guo, and W. Wang, “Towards an understanding of large language models in software engineering tasks,” arXiv preprint arXiv:2308.11396, 2023.
  15. Z. Zheng, K. Ning, Y. Wang, J. Zhang, D. Zheng, M. Ye, and J. Chen, “A survey of large language models for code: Evolution, benchmarking, and future trends,” arXiv preprint arXiv:2311.10372, 2023.
  16. X. Bi, D. Chen, G. Chen, S. Chen, D. Dai, C. Deng, H. Ding, K. Dong, Q. Du, Z. Fu et al., “Deepseek llm: Scaling open-source language models with longtermism,” arXiv preprint arXiv:2401.02954, 2024.
  17. B. Rozière, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. Tan, Y. Adi, J. Liu, T. Remez, J. Rapin, A. Kozhevnikov, I. Evtimov, J. Bitton, M. P. Bhatt, C. C. Ferrer, A. Grattafiori, W. Xiong, A. D’efossez, J. Copet, F. Azhar, H. Touvron, L. Martin, N. Usunier, T. Scialom, and G. Synnaeve, “Code llama: Open foundation models for code,” ArXiv, vol. abs/2308.12950, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2308.12950
  18. P. BehnamGhader, V. Adlakha, M. Mosbach, D. Bahdanau, N. Chapados, and S. Reddy, “Llm2vec: Large language models are secretly powerful text encoders,” arXiv preprint arXiv:2404.05961, 2024.
  19. Georgepitt, “Decoderllms-codesearch,” 2024, accessed: 2024-10-17. [Online]. Available: https://github.com/Georgepitt/DecoderLLMs-CodeSearch
  20. S. Chatterjee, S. Juvekar, and K. Sen, “Sniff: A search engine for java using free-form queries,” in Fundamental Approaches to Software Engineering: 12th International Conference, FASE 2009, Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2009, York, UK, March 22-29, 2009. Proceedings 12.   Springer, 2009, pp. 385–400.
  21. X. Gu, H. Zhang, and S. Kim, “Deep code search,” in Proceedings of the 40th International Conference on Software Engineering, 2018, pp. 933–944.
  22. C. Watson, N. Cooper, D. N. Palacio, K. Moran, and D. Poshyvanyk, “A systematic literature review on the use of deep learning in software engineering research,” ACM Transactions on Software Engineering and Methodology (TOSEM), vol. 31, no. 2, pp. 1–58, 2022.
  23. W. Ye, R. Xie, J. Zhang, T. Hu, X. Wang, and S. Zhang, “Leveraging code generation to improve code retrieval and summarization via dual learning,” in Proceedings of The Web Conference 2020, 2020, pp. 2309–2319.
  24. J. Shuai, L. Xu, C. Liu, M. Yan, X. Xia, and Y. Lei, “Improving code search with co-attentive representation learning,” in Proceedings of the 28th International Conference on Program Comprehension, 2020, pp. 196–207.
  25. W. Li, H. Qin, S. Yan, B. Shen, and Y. Chen, “Learning code-query interaction for enhancing code searches,” in 2020 IEEE International Conference on Software Maintenance and Evolution (ICSME).   IEEE, 2020, pp. 115–126.
  26. C. Ling, Z. Lin, Y. Zou, and B. Xie, “Adaptive deep code search,” in Proceedings of the 28th International Conference on Program Comprehension, 2020, pp. 48–59.
  27. Y. Wan, J. Shu, Y. Sui, G. Xu, Z. Zhao, J. Wu, and P. Yu, “Multi-modal attention network learning for semantic source code retrieval,” in 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE).   IEEE, 2019, pp. 13–25.
  28. X. Ling, L. Wu, S. Wang, G. Pan, T. Ma, F. Xu, A. X. Liu, C. Wu, and S. Ji, “Deep graph matching and searching for semantic code retrieval,” ACM Transactions on Knowledge Discovery from Data (TKDD), vol. 15, no. 5, pp. 1–21, 2021.
  29. L. Du, X. Shi, Y. Wang, E. Shi, S. Han, and D. Zhang, “Is a single model enough? mucos: A multi-model ensemble learning approach for semantic code search,” in Proceedings of the 30th ACM International Conference on Information & Knowledge Management, 2021, pp. 2994–2998.
  30. Q. Zhu, Z. Sun, X. Liang, Y. Xiong, and L. Zhang, “Ocor: An overlapping-aware code retriever,” in Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering, 2020, pp. 883–894.
  31. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
  32. H. Husain, H.-H. Wu, T. Gazit, M. Allamanis, and M. Brockschmidt, “Codesearchnet challenge: Evaluating the state of semantic code search,” arXiv preprint arXiv:1909.09436, 2019.
  33. N. D. Bui, Y. Yu, and L. Jiang, “Self-supervised contrastive learning for code retrieval and summarization via semantic-preserving transformations,” in Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2021, pp. 511–521.
  34. Z. Sun, L. Li, Y. Liu, X. Du, and L. Li, “On the importance of building high-quality training datasets for neural code search,” in Proceedings of the 44th International Conference on Software Engineering, 2022, pp. 1609–1620.
  35. X. Ma, L. Wang, N. Yang, F. Wei, and J. Lin, “Fine-tuning llama for multi-stage text retrieval,” in Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2024, pp. 2421–2425.
  36. Llama2, 2023. [Online]. Available: https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/
  37. L. Wang, N. Yang, X. Huang, L. Yang, R. Majumder, and F. Wei, “Improving text embeddings with large language models,” arXiv preprint arXiv:2401.00368, 2023.
  38. J. M. Springer, S. Kotha, D. Fried, G. Neubig, and A. Raghunathan, “Repetition improves language model embeddings,” arXiv preprint arXiv:2402.15449, 2024.
  39. J. Gong, Y. Wu, L. Liang, Z. Zheng, and Y. Wang, “Cosqa+: Enhancing code search dataset with matching code,” arXiv preprint arXiv:2406.11589, 2024.
  40. J. Huang, D. Tang, L. Shou, M. Gong, K. Xu, D. Jiang, M. Zhou, and N. Duan, “Cosqa: 20,000+ web queries for code search and question answering,” arXiv preprint arXiv:2105.13239, 2021.
  41. Z. Yao, D. S. Weld, W.-P. Chen, and H. Sun, “Staqc: A systematically mined question-code dataset from stack overflow,” in Proceedings of the 2018 World Wide Web Conference, 2018, pp. 1693–1703.
  42. E. M. Voorhees et al., “The trec-8 question answering track report.” in Trec, vol. 99, 1999, pp. 77–82.
  43. K. A. Hambarde and H. Proenca, “Information retrieval: recent advances and beyond,” IEEE Access, 2023.
  44. Meta, “Meta-llama-3-8b-instruct,” https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct, 2024.
  45. Mistral. (2024) Mistral-7b-instruct-v0.2. [Online]. Available: https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2
  46. DeepSeek-AI, “Deepseek-llm-7b-chat,” https://huggingface.co/deepseek-ai/deepseek-llm-7b-chat, 2024.
  47. Google, “Gemma-7b-it,” https://huggingface.co/google/gemma-7b-it, 2024.
  48. Meta, “Llama-2-7b-hf,” https://huggingface.co/meta-llama/Llama-2-7b-hf, 2024.
  49. TheBloke, “Mistral-7b-instruct-v0.2-code-ft-gguf,” https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-code-ft-GGUF, 2024.
  50. DeepSeek-AI, “Deepseek-coder-6.7b-instruct,” https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-instruct, 2024.
  51. Google, “Codegemma-7b-it,” https://huggingface.co/google/codegemma-7b-it, 2024.
  52. B. Rozière, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y. Adi, J. Liu, T. Remez, J. Rapin et al., “Code llama: Open foundation models for code,” arXiv preprint arXiv:2308.12950, 2023.
  53. Microsoft, “Codebert-base,” https://huggingface.co/microsoft/codebert-base, 2024.
  54. ——, “Unixcoder-base,” https://huggingface.co/microsoft/unixcoder-base, 2024.
  55. H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
  56. A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. d. l. Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier et al., “Mistral 7b,” arXiv preprint arXiv:2310.06825, 2023.
  57. Code-Mistral, 2024. [Online]. Available: https://huggingface.co/ajibawa-2023/Code-Mistral-7B
  58. . Ajibawa, “Code-290k-sharegpt,” https://huggingface.co/datasets/ajibawa-2023/Code-290k-ShareGPT, 2024.
  59. G. Team, T. Mesnard, C. Hardin, R. Dadashi, S. Bhupatiraju, S. Pathak, L. Sifre, M. Rivière, M. S. Kale, J. Love et al., “Gemma: Open models based on gemini research and technology,” arXiv preprint arXiv:2403.08295, 2024.
  60. C. T. H. Zhao, J. Hui, J. Howland, N. Nguyen, S. Zuo, A. Hu, C. A. Choquette-Choo, J. Shen, J. Kelley, K. Bansal, L. Vilnis, M. Wirth, P. Michel, P. Choy, P. Joshi, R. Kumar, S. Hashmi, S. Agrawal, Z. Gong, J. Fine, T. B. Warkentin, A. J. Hartman, B. Ni, K. Korevec, K. Schaefer, and S. Huffman, “Codegemma: Open code models based on gemma,” ArXiv, vol. abs/2406.11409, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2406.11409
  61. A. Dubey, A. Jauhri, A. Pandey, and et al., “The llama 3 herd of models,” 2024. [Online]. Available: https://arxiv.org/abs/2407.21783
  62. Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
  63. M. Hasan, T. Muttaqueen, A. A. Ishtiaq, K. S. Mehrab, M. M. A. Haque, T. Hasan, W. U. Ahmad, A. Iqbal, and R. Shahriyar, “Codesc: A large code-description parallel dataset,” arXiv preprint arXiv:2105.14220, 2021.
  64. “MTEB Leaderboard,” https://huggingface.co/spaces/mteb/leaderboard.
  65. T. Gao, X. Yao, and D. Chen, “Simcse: Simple contrastive learning of sentence embeddings,” arXiv preprint arXiv:2104.08821, 2021.
  66. B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal et al., “Language models are few-shot learners,” arXiv preprint arXiv:2005.14165, vol. 1, 2020.
  67. P. with Code. (2024) Code generation on mbpp. Accessed: August 1, 2024. [Online]. Available: https://paperswithcode.com/sota/code-generation-on-mbpp

Summary

  • The paper finds that decoder-only LLMs require task-specific fine-tuning to outperform encoder-based models in code search.
  • Fine-tuning yields notable gains, with models like CodeGemma improving Mean Reciprocal Rank on datasets such as CSN and CoSQA+.
  • The study highlights that decoder-only architectures harness longer input sequences, underscoring the importance of specialized training techniques.

Overview of the Paper "Are Decoder-Only LLMs the Silver Bullet for Code Search?"

The paper focuses on evaluating the effectiveness of decoder-only LLMs for code search tasks, particularly in comparison to the traditionally used encoder-based models. The researchers investigate whether the pre-training and architectural advantages of decoder-only models, which allow for longer input processing and potentially better generalization, can lead to significant improvements in code search, a critical task for developers aiming to reuse code efficiently.

Contributions and Methodology

The authors conduct a systematic exploration involving nine prominent decoder-only LLMs. Their investigation centers on three main questions: the performance of these models in zero-shot settings, improvements from fine-tuning, and underlying reasons for performance variations.

  1. Zero-shot Performance: The paper begins by comparing the zero-shot capabilities of decoder-only models with encoder-only models such as UniXcoder. Despite their large size and sophisticated pre-training, decoder-only models initially underperform their encoder-based counterparts in zero-shot code search tasks. This finding underscores that task-specific pre-training is vital for achieving adequate zero-shot performance.
  2. Fine-tuning Improvements: Through fine-tuning, decoder-only models derive substantial gains in performance across benchmark datasets like CSN and CoSQA+^+. Fine-tuned decoder-only models, particularly CodeGemma, demonstrate superior adaptability, achieving a 5.57% improvement in Mean Reciprocal Rank (MRR) on the CSN dataset over fine-tuned UniXcoder. This improvement is even more pronounced on the CoSQA$+ dataset, highlighting the stronger generalization ability of decoder-only frameworks when fine-tuned properly.
  3. Analysis of Improvements: The authors explore why fine-tuning enhances performance. The paper contrasts supervised contrastive learning with unsupervised methods, illustrating the superiority of the former in terms of clarity and embedding optimization. Additionally, they assess the role of dataset specificity and model size, identifying that both architecture and comprehensive, task-specific datasets significantly influence efficacy.

Results and Implications

The findings have several implications. Firstly, decoder-only models, given sufficient task-specific fine-tuning, can outperform traditional encoder-based models, thus validating their potential in handling code search tasks. However, this potential is not inherent and requires tailored training methods and data.

The research also presents evidence that the architecture of decoder-only LLMs provides advantages in handling longer queries and diverse datasets. However, challenges persist with ultra-short queries due to the curse of dimensionality and lack of contextual clarity. These limitations suggest that ongoing efforts to optimize model architectures for specific input types remain necessary.

Future Directions

This paper invites numerous future research avenues. One area of interest lies in improving zero-shot performance, potentially through hybrid architectures that blend decoder and encoder properties. Additionally, the findings encourage further exploration of specialized fine-tuning datasets and training strategies to alleviate the limitations associated with short inputs.

Overall, the paper showcases the versatility and potential decoder-only LLMs hold for code search, while also highlighting the need for continued innovation in model training and fine-tuning to leverage these strengths fully. As the capabilities and applications of LLMs expand, decoder-only architectures are poised to play a critical role in advancing software engineering workflows, facilitating more efficient code reuse and improving developer productivity.

X Twitter Logo Streamline Icon: https://streamlinehq.com