Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation (2303.12570v3)

Published 22 Mar 2023 in cs.CL, cs.AI, cs.PL, and cs.SE

Abstract: The task of repository-level code completion is to continue writing the unfinished code based on a broader context of the repository. While for automated code completion tools, it is difficult to utilize the useful information scattered in different files. We propose RepoCoder, a simple, generic, and effective framework to address the challenge. It streamlines the repository-level code completion process by incorporating a similarity-based retriever and a pre-trained code LLM in an iterative retrieval-generation pipeline. RepoCoder makes effective utilization of repository-level information for code completion and has the ability to generate code at various levels of granularity. Moreover, we propose a new benchmark RepoEval, which consists of the latest and high-quality real-world repositories covering line, API invocation, and function body completion scenarios. Experimental results indicate that RepoCoder significantly improves the In-File completion baseline by over 10% in all settings and consistently outperforms the vanilla retrieval-augmented code completion approach. Furthermore, we validate the effectiveness of RepoCoder through comprehensive analysis, providing valuable insights for future research. Our source code and benchmark are publicly available: https://github.com/microsoft/CodeT/tree/main/RepoCoder

RepoCoder: Enhancing Repository-Level Code Completion via Iterative Retrieval and Generation

The paper presents RepoCoder, an innovative framework designed for repository-level code completion. This task involves generating code continuations by understanding the context of the entire software repository rather than just relying on localized or in-file information. This is essential in practical software development environments, where various files within a repository often interact or are interdependent to form a cohesive software infrastructure.

Core Methodology

RepoCoder operates by integrating a retrieval-generation pipeline, which involves two key components: a similarity-based code retriever and a pre-trained code LLM. The process is iterative, allowing RepoCoder to refine its completions by successively optimizing the retrieval and generation phases. Initially, the retriever searches for relevant code snippets scattered across the repository using the incomplete code as a query. These snippets augments the LLM's input, providing a broader context that enhances the generative capabilities for code completion. Beyond a single retrieval and generation step, RepoCoder iterates by using the model-generated code completions to further refine retrieval queries, subsequently enhancing the retrieval process.

Evaluation and Results

RepoCoder's efficacy is evaluated on a newly proposed benchmark, RepoEval, which features high-quality, real-world repositories from GitHub. These repositories are specifically selected to cover a wide range of granular completion challenges, including line, API invocation, and function body completion. When compared against traditional In-File completion strategies and a retrieval-augmented generation (RAG) baseline, RepoCoder exhibits a notable performance boost, improving In-File completion metrics by over 10% across various settings. This demonstrates the framework's capability to leverage full repository contexts effectively.

Implications and Future Directions

The introduction of RepoCoder signifies a meaningful advancement in the domain of code generation within software repositories. The iterative nature of its retrieval-generation process offers enhanced flexibility and adaptability over static rule-based methods or models that lack such iterative refinement. The practical implications of RepoCoder are substantial, potentially supporting developers by automating code completions that adhere to repository-specific coding styles and API structures.

The research opens up several avenues for future exploration. One potential direction would be exploring more advanced retrieval models, potentially incorporating deep learning approaches that can further interpret code semantics for improved snippet retrieval. Another is to evaluate more diverse and complex repository environments, such as those involving mixed languages and frameworks, to assess RepoCoder's scalability and adaptability.

In summary, the paper provides a comprehensive, well-executed paper on enhancing code completion capabilities by utilizing repository-level information. RepoCoder's iterative framework marks a progressive step in aligning code-generation outputs closely with a project's broader architectural context, promising improvements in both the efficiency and accuracy of software development workflows.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  2. Codet: Code generation with generated tests. arXiv preprint arXiv:2207.10397.
  3. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
  4. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
  5. Long-range modeling of source code files with ewash: Extended window access by syntax hierarchy. arXiv preprint arXiv:2109.08780.
  6. Cocomic: Code completion by jointly modeling in-file and cross-file context. arXiv preprint arXiv:2212.10007.
  7. Unixcoder: Unified cross-modal pre-training for code representation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7212–7225.
  8. Retrieval augmented language model pre-training. In International conference on machine learning, pages 3929–3938. PMLR.
  9. Vincent J Hellendoorn and Premkumar Devanbu. 2017. Are deep neural networks the best choice for modeling source code? In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, pages 763–773.
  10. When code completion fails: A case study on real-world completions. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pages 960–970. IEEE.
  11. Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299.
  12. Paul Jaccard. 1912. The distribution of the flora in the alpine zone. 1. New phytologist, 11(2):37–50.
  13. Vladimir I Levenshtein et al. 1966. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, volume 10, pages 707–710. Soviet Union.
  14. Standing on the shoulders of giant frozen language models. arXiv preprint arXiv:2204.10019.
  15. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.
  16. Generation-augmented query expansion for code retrieval. arXiv preprint arXiv:2212.10692.
  17. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161.
  18. Learning to recommend method names with global context. In Proceedings of the 44th International Conference on Software Engineering, pages 1294–1306.
  19. Reacc: A retrieval-augmented code completion framework. arXiv preprint arXiv:2203.07722.
  20. Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664.
  21. Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568.
  22. Generation-augmented retrieval for open-domain question answering. arXiv preprint arXiv:2009.08553.
  23. Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474.
  24. OpenAI. 2023. Gpt-4 technical report.
  25. In-context retrieval-augmented language models. arXiv preprint arXiv:2302.00083.
  26. Code completion with statistical language models. In Proceedings of the 35th ACM SIGPLAN conference on programming language design and implementation, pages 419–428.
  27. Retrieval augmented code generation and summarization. arXiv e-prints, pages arXiv–2108.
  28. Replug: Retrieval-augmented black-box language models. arXiv preprint arXiv:2301.12652.
  29. Repository-level prompt generation for large language models of code. arXiv preprint arXiv:2206.12839.
  30. Intellicode compose: Code generation using transformer. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 1433–1443.
  31. Fast and memory-efficient neural code completion. In 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR), pages 329–340. IEEE.
  32. Pythia: Ai-assisted code completion system. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pages 2727–2735.
  33. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239.
  34. On the localness of software. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, pages 269–280.
  35. Query2doc: Query expansion with large language models. arXiv preprint arXiv:2303.07678.
  36. When language model meets private library. arXiv preprint arXiv:2210.17236.
  37. Generate-and-retrieve: Use your predictions to improve retrieval for semantic parsing. In Proceedings of the 29th International Conference on Computational Linguistics, pages 4946–4951.
  38. Retgen: A joint framework for retrieval and grounded text generation modeling. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 11739–11747.
  39. Doccoder: Generating code by retrieving and reading docs. arXiv preprint arXiv:2207.05987.
  40. How does code style inconsistency affect pull request integration? an exploratory study on 117 github projects. Empirical Software Engineering, 24:3871–3903.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Fengji Zhang (12 papers)
  2. Bei Chen (56 papers)
  3. Yue Zhang (618 papers)
  4. Jacky Keung (17 papers)
  5. Jin Liu (151 papers)
  6. Daoguang Zan (24 papers)
  7. Yi Mao (78 papers)
  8. Jian-Guang Lou (69 papers)
  9. Weizhu Chen (128 papers)
Citations (155)
Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com