RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation (2303.12570v3)

Published 22 Mar 2023 in cs.CL, cs.AI, cs.PL, and cs.SE

Abstract: The task of repository-level code completion is to continue writing the unfinished code based on a broader context of the repository. While for automated code completion tools, it is difficult to utilize the useful information scattered in different files. We propose RepoCoder, a simple, generic, and effective framework to address the challenge. It streamlines the repository-level code completion process by incorporating a similarity-based retriever and a pre-trained code LLM in an iterative retrieval-generation pipeline. RepoCoder makes effective utilization of repository-level information for code completion and has the ability to generate code at various levels of granularity. Moreover, we propose a new benchmark RepoEval, which consists of the latest and high-quality real-world repositories covering line, API invocation, and function body completion scenarios. Experimental results indicate that RepoCoder significantly improves the In-File completion baseline by over 10% in all settings and consistently outperforms the vanilla retrieval-augmented code completion approach. Furthermore, we validate the effectiveness of RepoCoder through comprehensive analysis, providing valuable insights for future research. Our source code and benchmark are publicly available: https://github.com/microsoft/CodeT/tree/main/RepoCoder

PDF HTML Abstract

RepoCoder: Enhancing Repository-Level Code Completion via Iterative Retrieval and Generation

The paper presents RepoCoder, an innovative framework designed for repository-level code completion. This task involves generating code continuations by understanding the context of the entire software repository rather than just relying on localized or in-file information. This is essential in practical software development environments, where various files within a repository often interact or are interdependent to form a cohesive software infrastructure.

Core Methodology

RepoCoder operates by integrating a retrieval-generation pipeline, which involves two key components: a similarity-based code retriever and a pre-trained code LLM. The process is iterative, allowing RepoCoder to refine its completions by successively optimizing the retrieval and generation phases. Initially, the retriever searches for relevant code snippets scattered across the repository using the incomplete code as a query. These snippets augments the LLM's input, providing a broader context that enhances the generative capabilities for code completion. Beyond a single retrieval and generation step, RepoCoder iterates by using the model-generated code completions to further refine retrieval queries, subsequently enhancing the retrieval process.

Evaluation and Results

RepoCoder's efficacy is evaluated on a newly proposed benchmark, RepoEval, which features high-quality, real-world repositories from GitHub. These repositories are specifically selected to cover a wide range of granular completion challenges, including line, API invocation, and function body completion. When compared against traditional In-File completion strategies and a retrieval-augmented generation (RAG) baseline, RepoCoder exhibits a notable performance boost, improving In-File completion metrics by over 10% across various settings. This demonstrates the framework's capability to leverage full repository contexts effectively.

Implications and Future Directions

The introduction of RepoCoder signifies a meaningful advancement in the domain of code generation within software repositories. The iterative nature of its retrieval-generation process offers enhanced flexibility and adaptability over static rule-based methods or models that lack such iterative refinement. The practical implications of RepoCoder are substantial, potentially supporting developers by automating code completions that adhere to repository-specific coding styles and API structures.

The research opens up several avenues for future exploration. One potential direction would be exploring more advanced retrieval models, potentially incorporating deep learning approaches that can further interpret code semantics for improved snippet retrieval. Another is to evaluate more diverse and complex repository environments, such as those involving mixed languages and frameworks, to assess RepoCoder's scalability and adaptability.

In summary, the paper provides a comprehensive, well-executed paper on enhancing code completion capabilities by utilizing repository-level information. RepoCoder's iterative framework marks a progressive step in aligning code-generation outputs closely with a project's broader architectural context, promising improvements in both the efficiency and accuracy of software development workflows.