Retrieval Augmented Code Generation and Summarization (2108.11601v2)

Published 26 Aug 2021 in cs.SE and cs.CL

Abstract: Software developers write a lot of source code and documentation during software development. Intrinsically, developers often recall parts of source code or code summaries that they had written in the past while implementing software or documenting them. To mimic developers' code or summary generation behavior, we propose a retrieval augmented framework, REDCODER, that retrieves relevant code or summaries from a retrieval database and provides them as a supplement to code generation or summarization models. REDCODER has a couple of uniqueness. First, it extends the state-of-the-art dense retrieval technique to search for relevant code or summaries. Second, it can work with retrieval databases that include unimodal (only code or natural language description) or bimodal instances (code-description pairs). We conduct experiments and extensive analysis on two benchmark datasets of code generation and summarization in Java and Python, and the promising results endorse the effectiveness of our proposed retrieval augmented framework.

Authors (5)

Md Rizwan Parvez (24 papers)
Wasi Uddin Ahmad (41 papers)
Saikat Chakraborty (62 papers)
Baishakhi Ray (88 papers)
Kai-Wei Chang (292 papers)

Citations (142)

View on Semantic Scholar

Summary

Overview of Retrieval Augmented Code Generation and Summarization

The paper "Retrieval Augmented Code Generation and Summarization" presents a novel framework named REDCODER aimed at enhancing the processes of code generation and summarization by mimicking the behavior of experienced software developers. Developers frequently rely on previously written code and documentation when implementing new functionalities or describing them. This framework proposes a retrieval augmented approach, effectively utilizing existing high-quality source code and summaries from databases as supplementary input for code generation and summarization models.

The paper identifies the complex nature of source code generation and summarization, which involves diverse token sequences, requiring a deep understanding of programming languages at multiple levels, including lexical, syntax, and semantics. Existing learning-based approaches leveraging high-quality human-written code and text from open-source repositories have shown promise, yet significant limitations persist in handling long source code sequences.

Key Contributions

Dense Retrieval Technique: REDCODER employs an advanced dense retrieval method to fetch relevant code or summaries. This is in contrast with sparse retrieval models such as TF-IDF or BM25, which struggle with understanding the intricate syntactic and semantic structures of code and natural language descriptions.
Flexible Database Utilization: The framework is capable of working with databases containing either unimodal or bimodal instances, making it versatile for different code and summary generation contexts.
Modular Framework: REDCODER's design allows a two-step process involving retrieval followed by generation, supporting various implementations of retriever and generator models while preserving model agnostic properties.

Empirical Evaluation

The paper conducted extensive experiments using benchmark datasets, revealing that REDCODER significantly improves the performance of existing state-of-the-art systems in both Java and Python. The framework boosts Exact Match scores on code generation and enhances BLEU-4 scores for code summarization, demonstrating its efficacy in utilizing retrieval-augmented input. Interestingly, even when target candidates were removed from the retrieval database, REDCODER exhibited substantial performance gains.

Implications and Future Directions

The implications of REDCODER are promising, as it not only increases programmers' productivity but also reduces their workload by autonomously generating higher quality code and summaries. The framework presents a robust approach to incorporating retrieval mechanisms directly into the code generation and summarization process, potentially influencing future developments in AI-driven software engineering tools.

In terms of future advancements, the REDCODER approach could be extended to other code automation tasks such as code translation, presenting new avenues for research in the application of retrieval-augmented models. Further research could also focus on improving retrieval techniques or exploring different configurations of retrieval databases to enhance the efficacy of the framework.

Overall, the paper succeeds in presenting a comprehensive and well-evidenced argument towards the integration of retrieval-based augmentation in code generation and summarization, marking a significant step forward in developing intelligent programming assistants.

PDF Markdown

Related Papers

GitHub

GitHub - rizwan09/REDCODER (45 stars)