Rewriting the Code: A Simple Method for Large Language Model Augmented Code Search (2401.04514v2)
Abstract: In code search, the Generation-Augmented Retrieval (GAR) framework, which generates exemplar code snippets to augment queries, has emerged as a promising strategy to address the principal challenge of modality misalignment between code snippets and natural language queries, particularly with the demonstrated code generation capabilities of LLMs. Nevertheless, our preliminary investigations indicate that the improvements conferred by such an LLM-augmented framework are somewhat constrained. This limitation could potentially be ascribed to the fact that the generated codes, albeit functionally accurate, frequently display a pronounced stylistic deviation from the ground truth code in the codebase. In this paper, we extend the foundational GAR framework and propose a simple yet effective method that additionally Rewrites the Code (ReCo) within the codebase for style normalization. Experimental results demonstrate that ReCo significantly boosts retrieval accuracy across sparse (up to 35.7%), zero-shot dense (up to 27.6%), and fine-tuned dense (up to 23.6%) retrieval settings in diverse search scenarios. To further elucidate the advantages of ReCo and stimulate research in code style normalization, we introduce Code Style Similarity, the first metric tailored to quantify stylistic similarities in code. Notably, our empirical findings reveal the inadequacy of existing metrics in capturing stylistic nuances. The source code and data are available at \url{https://github.com/Alex-HaochenLi/ReCo}.
- Multi-lingual evaluation of code generation models. arXiv preprint arXiv:2210.14868.
- Program synthesis with large language models. arXiv preprint arXiv:2108.07732.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- When deep learning met code search. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 964–974.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
- Aryaz Eghbali and Michael Pradel. 2022. Crystalbleu: precisely and efficiently measuring the similarity of code. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, pages 1–12.
- Codebert: A pre-trained model for programming and natural languages. In Findings of the Association for Computational Linguistics, pages 1536–1547.
- Precise zero-shot dense retrieval without relevance labels. arXiv preprint arXiv:2212.10496.
- Multimodal representation for neural code search. In 2021 IEEE International Conference on Software Maintenance and Evolution (ICSME), pages 483–494.
- Deep code search. In 2018 IEEE/ACM 40th International Conference on Software Engineering, pages 933–944.
- Unixcoder: Unified cross-modal pre-training for code representation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, pages 7212–7225.
- Graphcodebert: Pre-training code representations with data flow. In 9th International Conference on Learning Representations.
- Measuring coding challenge competence with apps. arXiv preprint arXiv:2105.09938.
- Improving source code search with natural language phrasal representations of method signatures. In 2011 26th IEEE/ACM International Conference on Automated Software Engineering, pages 524–527.
- Cosqa: 20, 000+ web queries for code search and question answering. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pages 5690–5700.
- Codesearchnet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436.
- Unsupervised dense information retrieval with contrastive learning. arXiv preprint arXiv:2112.09118.
- Exploring representation-level augmentation for code search. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 4924–4936.
- Rethinking negative pairs in code search. arXiv preprint arXiv:2310.08069.
- Coderetriever: Unimodal and bimodal contrastive learning. arXiv preprint arXiv:2201.10866.
- Soft-labeled contrastive pre-training for function-level code representation. arXiv preprint arXiv:2210.09597.
- Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
- Pyserini: A python toolkit for reproducible information retrieval research with sparse and dense representations. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2356–2362.
- Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568.
- Self-supervised query reformulation for code search. arXiv preprint arXiv:2307.00267.
- Generation-augmented retrieval for open-domain question answering. arXiv preprint arXiv:2009.08553.
- Portfolio: finding relevant functions and their usage. In Proceedings of the 33rd International Conference on Software Engineering, pages 111–120.
- Query expansion based on crowd knowledge for code search. IEEE Transactions on Services Computing, 9(5):771–783.
- Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474.
- OpenAI. 2023a. Chatgpt.
- OpenAI. 2023b. Gpt-4 technical report. ArXiv, abs/2303.08774.
- Benjamin Paaßen. 2018. Revisiting the tree edit distance and its backtracing: A tutorial. arXiv preprint arXiv:1805.06869.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
- Large language models are effective text rankers with pairwise ranking prompting. arXiv preprint arXiv:2306.17563.
- Mohammad Masudur Rahman and Chanchal K Roy. 2021. A systematic literature review of automated query reformulations in source code search. arXiv preprint arXiv:2108.09646.
- Codebleu: a method for automatic evaluation of code synthesis. arXiv preprint arXiv:2009.10297.
- Eric Sven Ristad and Peter N Yianilos. 1998. Learning string-edit distance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(5):522–532.
- Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950.
- Abdus Satter and Kazi Sakib. 2016. A search log mining based query expansion technique to improve effectiveness in code search. In 2016 19th International Conference on Computer and Information Technology, pages 586–591.
- Enhancing semantic code search with multimodal contrastive learning and soft data augmentation. arXiv preprint arXiv:2204.03293.
- Improving code search with co-attentive representation learning. In Proceedings of the 28th International Conference on Program Comprehension, pages 196–207.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Representation learning with contrastive predictive coding. arXiv e-prints, pages arXiv–1807.
- Query2doc: Query expansion with large language models. arXiv preprint arXiv:2303.07678.
- Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. arXiv preprint arXiv:2109.00859.
- Yangrui Yang and Qing Huang. 2017. Iecs: Intent-enforced code search via extended boolean model. Journal of Intelligent & Fuzzy Systems, 33(4):2565–2576.
- Learning to mine aligned code and natural language pairs from stack overflow. In Proceedings of the 15th international conference on mining software repositories, pages 476–486.
- When neural model meets nl2code: A survey. arXiv preprint arXiv:2212.09420.
- Haochen Li (42 papers)
- Xin Zhou (319 papers)
- Zhiqi Shen (62 papers)