Self-Supervised Query Reformulation for Code Search (2307.00267v1)
Abstract: Automatic query reformulation is a widely utilized technology for enriching user requirements and enhancing the outcomes of code search. It can be conceptualized as a machine translation task, wherein the objective is to rephrase a given query into a more comprehensive alternative. While showing promising results, training such a model typically requires a large parallel corpus of query pairs (i.e., the original query and a reformulated query) that are confidential and unpublished by online code search engines. This restricts its practicality in software development processes. In this paper, we propose SSQR, a self-supervised query reformulation method that does not rely on any parallel query corpus. Inspired by pre-trained models, SSQR treats query reformulation as a masked LLMing task conducted on an extensive unannotated corpus of queries. SSQR extends T5 (a sequence-to-sequence model based on Transformer) with a new pre-training objective named corrupted query completion (CQC), which randomly masks words within a complete query and trains T5 to predict the masked content. Subsequently, for a given query to be reformulated, SSQR identifies potential locations for expansion and leverages the pre-trained T5 model to generate appropriate content to fill these gaps. The selection of expansions is then based on the information gain associated with each candidate. Evaluation results demonstrate that SSQR outperforms unsupervised baselines significantly and achieves competitive performance compared to supervised methods.
- 2001. Lucene. https://lucene.apache.org.
- 2006. Apache Solr. https://solr.apache.org.
- 2012. Index Tank. https://github.com/LinkedInAttic/indextank-engine.
- 2019. CodeSearchNet. https://github.com/github/CodeSearchNet.
- 2019. PyTorch Lightning. https://www.pytorchlightning.ai.
- Miltiadis Allamanis. 2019. The adverse effects of code duplication in machine learning models of code. In ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software, Onward!, October 23-24, 2019. 143–153.
- Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems 33, NeurIPS 2020, December 6-12, 2020.
- Automated Query Reformulation for Efficient Search based on Query Logs From Stack Overflow. In 43rd IEEE/ACM International Conference on Software Engineering, ICSE 2021, 22-30 May 2021. 1273–1285.
- Cross-Domain Deep Code Search with Meta Learning. In 44th IEEE/ACM 44th International Conference on Software Engineering, ICSE 2022, Pittsburgh, PA, USA, May 25-27, 2022. ACM, 487–498.
- Towards a better understanding of query reformulation behavior in web search. In Proceedings of the Web Conference. 743–755.
- R. C. Cornea and N. B. Weininger. 2014. Providing autocomplete suggestions.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, June 2-7, 2019, Volume 1. 4171–4186.
- Zachary Eberhart and Collin McMillan. 2022. Generating clarifying questions for query refinement in source code search. In 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). 140–151.
- CodeBERT: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155 (2020).
- Clinton Gormley and Zachary Tong. 2015. Elasticsearch: the definitive guide: a distributed real-time search and analytics engine. O’Reilly Media, Inc.
- Deep Code Search. In IEEE/ACM 40th International Conference on Software Engineering, ICSE 2018. 933–944.
- On the Effectiveness of Pretrained Models for API Learning. arXiv preprint arXiv:2204.03498 (2022).
- Automatic query reformulations for text retrieval in software engineering. In 35th International Conference on Software Engineering, ICSE 2013, May 18-26, 2013. 842–851.
- Automatically capturing source code context of NL-queries for software maintenance and reuse. In 2009 IEEE 31st International Conference on Software Engineering, ICSE. 232–242.
- Automatically mining software-based, semantically-similar words from comment-code mappings. In 10th working conference on mining software repositories, MSR 2013. 377–386.
- Jeff Huang and Efthimis N Efthimiadis. 2009. Analyzing and evaluating query reformulation strategies in web search logs. In Proceedings of the 18th ACM conference on Information and knowledge management. 77–86.
- Query expansion via intent predicting. International Journal of Software Engineering and Knowledge Engineering 27, 09n10 (2017), 1591–1601.
- HuggingFace. 2022. T5-base model checkpoint. https://huggingface.co/t5-base.
- Generalization error in deep learning. In Compressed sensing and its applications. Springer, 153–193.
- Patterns of query reformulation during web searching. Journal of the american society for information science and technology 60, 7 (2009), 1358–1371.
- A relative decision entropy-based feature selection approach. Pattern Recognition 48, 7 (2015), 2151–2163.
- How can we know when language models know? on the calibration of language models for question answering. Transactions of the Association for Computational Linguistics 9 (2021), 962–977.
- An Exploratory Study of How Developers Seek, Relate, and Collect Relevant Information during Software Maintenance Tasks. IEEE Trans. Software Eng. 32, 12 (2006), 971–987.
- Richard A Kronmal. 1993. Spurious correlation and the fallacy of the ratio standard revisited. Journal of the Royal Statistical Society: Series A (Statistics in Society) 156, 3 (1993), 379–392.
- Dawn Lawrie and Dave Binkley. 2018. On the value of bug reports for retrieval-based bug localization. In 2018 IEEE International Conference on Software Maintenance and Evolution, ICSME. 524–528.
- A Cooperative Neural Information Retrieval Pipeline with Knowledge Enhanced Automatic Query Reformulation. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining. 553–561.
- How to formulate specific how-to questions in software development?. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 306–318.
- Self-Supervised Learning: Generative or Contrastive. IEEE Trans. Knowl. Data Eng. 35, 1 (2023), 857–876.
- Query expansion via WordNet for effective code search. In 22nd IEEE International Conference on Software Analysis, Evolution, and Reengineering, SANER 2015, March 2-6, 2015. 545–549.
- CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation. CoRR abs/2102.04664 (2021).
- Pasi Luukka. 2011. Feature selection using fuzzy entropy measures with similarity classifier. Expert Systems with Applications 38, 4 (2011), 4600–4607.
- Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083 (2017).
- G. A. Miller. 1990. Introduction to WordNet : An Online Lexical Database. In Meeting of the Association for Computational Linguistics.
- Animesh Nighojkar and John Licato. 2021. Improving paraphrase detection with the adversarial paraphrasing task. arXiv preprint arXiv:2106.07691 (2021).
- SPT-Code: Sequence-to-Sequence Pre-Training for Learning Source Code Representations. In IEEE/ACM 44th International Conference on Software Engineering, ICSE 2022. 01–13.
- Comparative analysis of different transformer based architectures used in sentiment analysis. In 9th International Conference System Modeling and Advancement in Research Trends, SMART. IEEE, 411–415.
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 21 (2020), 140:1–140:67.
- Mohammad Masudur Rahman and Chanchal K. Roy. 2018. Effective Reformulation of Query for Code Search Using Crowdsourced Knowledge and Extra-Large Data Analytics. In 2018 IEEE International Conference on Software Maintenance and Evolution, ICSME 2018, September 23-29, 2018. 473–484.
- Alfréd Rényi et al. 1961. On measures of entropy and information. In fourth Berkeley symposium on mathematical statistics and probability, Vol. 1. Berkeley, California, USA.
- How developers search for code: a case study. In 10th joint meeting on foundations of software engineering. 191–201.
- Abdus Satter and Kazi Sakib. 2016. A search log mining based query expansion technique to improve effectiveness in code search. In 19th International Conference on Computer and Information Technology, ICCIT. 586–591.
- Attention is all you need. Advances in neural information processing systems 30 (2017).
- CodeT5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. arXiv preprint arXiv:2109.00859 (2021).
- What do developers search for on the web? Empir. Softw. Eng. 22, 6 (2017), 3149–3185.
- Are the Code Snippets What We Are Searching for? A Benchmark and an Empirical Study on Code Search with Natural-Language Queries. In 27th IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2020, London, ON, Canada, February 18-21, 2020. IEEE, 344–354.
- Jinqiu Yang and Lin Tan. 2014. SWordNet: Inferring semantically related words from software context. Empirical Software Engineering 19, 6 (2014), 1856–1886.
- Self-supervised learning of smart contract representations. In Proceedings of the 30th IEEE/ACM International Conference on Program Comprehension, ICPC 2022, Virtual Event, May 16-17, 2022. ACM, 82–93.
- LoL: A Comparative Regularization Loss over Query Reformulation Losses for Pseudo-Relevance Feedback. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 825–836.
- Gustavo Zomer and Ana Frankenberg-Garcia. 2021. Beyond Grammatical Error Correction: Improving L1-influenced research writing in English using pre-trained encoder-decoder models. In Findings of the Association for Computational Linguistics (EMNLP). 2534–2540.
- Yuetian Mao (2 papers)
- Chengcheng Wan (14 papers)
- Yuze Jiang (11 papers)
- Xiaodong Gu (62 papers)