When Deep Learning Met Code Search (1905.03813v4)

Published 9 May 2019 in cs.SE, cs.CL, and cs.LG

Abstract: There have been multiple recent proposals on using deep neural networks for code search using natural language. Common across these proposals is the idea of $\mathit{embedding}$ code and natural language queries, into real vectors and then using vector distance to approximate semantic correlation between code and the query. Multiple approaches exist for learning these embeddings, including $\mathit{unsupervised}$ techniques, which rely only on a corpus of code examples, and $\mathit{supervised}$ techniques, which use an $\mathit{aligned}$ corpus of paired code and natural language descriptions. The goal of this supervision is to produce embeddings that are more similar for a query and the corresponding desired code snippet. Clearly, there are choices in whether to use supervised techniques at all, and if one does, what sort of network and training to use for supervision. This paper is the first to evaluate these choices systematically. To this end, we assembled implementations of state-of-the-art techniques to run on a common platform, training and evaluation corpora. To explore the design space in network complexity, we also introduced a new design point that is a $\mathit{minimal}$ supervision extension to an existing unsupervised technique. Our evaluation shows that: 1. adding supervision to an existing unsupervised technique can improve performance, though not necessarily by much; 2. simple networks for supervision can be more effective that more sophisticated sequence-based networks for code search; 3. while it is common to use docstrings to carry out supervision, there is a sizeable gap between the effectiveness of docstrings and a more query-appropriate supervision corpus. The evaluation dataset is now available at arXiv:1908.09804

Citations (210)

View on Semantic Scholar

Summary

The paper demonstrates that simple supervised models, like UNIF, can outperform more complex neural architectures in code search tasks.
It systematically compares various embedding techniques, revealing that training data quality plays a crucial role in performance.
The study highlights that leveraging fastText embeddings and attention mechanisms effectively aligns code snippets with natural language queries.

An Evaluation of Neural Code Search Techniques

The integration of deep learning with code search represents a notable research direction aiming to enhance developers' efficiency by retrieving code snippets based on natural language queries. The paper "When Deep Learning Met Code Search" addresses this domain by evaluating various neural network models that embed code and natural language into a joint vector space, using vector distance to approximate semantic similarity. This essay discusses the research presented in the paper, focusing on its methodologies, results, and findings regarding the effectiveness of different neural network architectures in the context of code search.

Overview and Methodology

The paper systematically evaluates the performance of supervised and unsupervised neural networks for code search, implementing several state-of-the-art techniques on a unified platform. The unsupervised technique, Neural Code Search (NCS), uses fastText embeddings combined with classical TF-IDF weights to represent both code and queries. In contrast, the supervised techniques, namely UNIF, CODEnn, and Semantic Code Search (SCS), utilize different neural architectures to refine embeddings based on training corpora containing aligned code and natural language pairs.

NCS: As a baseline, NCS employs an unsupervised embedding approach without conventional deep learning networks. Its embeddings for code are constructed with fastText word embeddings, using bag-of-words representations alongside TF-IDF weighting.
UNIF: This technique introduces minimal supervision to NCS, adopting a bag-of-words model with attention weights to better align code and query embeddings. It retains simplicity but enhances the embedding matrix through supervised training.
CODEnn: This approach leverages bi-directional LSTMs to encode sequences of code and query tokens, focusing on finer granularity in sequence information. CODEnn employs a more complex architecture by utilizing method name sequences, APIs, and natural language docstrings for embedding construction.
SCS: Introduced by GitHub, SCS employs GRUs and LSTMs to model the transformation between code sequences and docstrings. It emphasizes learning a transformation that can predict query embeddings from code sequences.

The authors assembled diverse datasets to train and evaluate these models, including Java-specific corpora and Android-specific sets, facilitating a comprehensive assessment through established benchmarks such as Java-50 and Android-287. The evaluation utilized metrics like the number of correctly retrieved queries at varying result thresholds and a mean reciprocal rank to quantify performance.

Results and Analysis

The authors observed:

Supervised vs. Unsupervised: Supervised methods tend to outperform unsupervised ones (e.g., UNIF surpassed unsupervised NCS) when appropriately matched training data is available. However, gains were not uniform, highlighting the influence of training data characteristics.
Model Complexity: Surprisingly, the simpler UNIF model outperformed the more sophisticated CODEnn and SCS on the benchmarks used, suggesting that complexity doesn't necessarily correlate with better performance in this context. The findings emphasize the need to consider simpler models before employing complex architectures.
Quality of Training Data: The choice of natural language descriptions used for supervision significantly impacts performance. For instance, leveraging more relevant training corpora containing natural language closely aligned with expected user queries improved the results substantially for all models tested.

Implications and Future Prospects

From a theoretical perspective, the paper demonstrates that embedding learning can be effectively applied to code search tasks, but emphasizes the critical role of training data quality over model sophistication. Practically, these findings inform developers and researchers about the trade-offs between model complexity and training data characteristics, guiding efficient design and deployment of neural code search systems.

The results point toward future opportunities to explore models that can dynamically adapt to varying qualities of input data and potentially hybrid systems that combine the strengths of multiple architectures. Additionally, a deeper examination of embedding alignment techniques could further enhance the semantic congruence between code snippets and natural language queries.

In conclusion, the paper provides valuable insights into neural code search, advocating for balanced design considerations that prioritize training data alignment and model simplicity for optimal performance. The implications extend to general embeddings research where applicability to practical tasks must account for the variability in data scenarios and search requirements.

PDF Markdown