REPOFUSE: Repository-Level Code Completion with Fused Dual Context (2402.14323v2)
Abstract: The success of LLMs in code assistance has spurred the proposal of repository-level code completion as a means to enhance prediction accuracy, utilizing the context from the entire codebase. However, this amplified context can inadvertently increase inference latency, potentially undermining the developer experience and deterring tool adoption - a challenge we termed the Context-Latency Conundrum. This paper introduces REPOFUSE, a pioneering solution designed to enhance repository-level code completion without the latency trade-off. REPOFUSE uniquely fuses two types of context: the analogy context, rooted in code analogies, and the rationale context, which encompasses in-depth semantic relationships. We propose a novel rank truncated generation (RTG) technique that efficiently condenses these contexts into prompts with restricted size. This enables REPOFUSE to deliver precise code completions while maintaining inference efficiency. Through testing with the CrossCodeEval suite, REPOFUSE has demonstrated a significant leap over existing models, achieving a 40.90% to 59.75% increase in exact match (EM) accuracy for code completions and a 26.8% enhancement in inference speed. Beyond experimental validation, REPOFUSE has been integrated into the workflow of a large enterprise, where it actively supports various coding tasks.
- Santacoder: don’t reach for the stars! CoRR, abs/2301.03988, 2023. doi: 10.48550/ARXIV.2301.03988. URL https://doi.org/10.48550/arXiv.2301.03988.
- Mining source code repositories at massive scale using language modeling. In Proceedings of the 10th Working Conference on Mining Software Repositories, MSR ’13, pp. 207–216. IEEE Press, 2013. ISBN 9781467329361.
- Codeplan: Repository-level coding using LLMs and planning. In NeurIPS 2023 Foundation Models for Decision Making Workshop, 2023. URL https://openreview.net/forum?id=d0A2pc2kFp.
- Grounded copilot: How programmers interact with code-generating models. Proc. ACM Program. Lang., 7(OOPSLA1):85–111, 2023. doi: 10.1145/3586030. URL https://doi.org/10.1145/3586030.
- Efficient training of language models to fill in the middle. CoRR, abs/2207.14255, 2022. doi: 10.48550/ARXIV.2207.14255. URL https://doi.org/10.48550/arXiv.2207.14255.
- Language models are few-shot learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546.
- Evaluating large language models trained on code, 2021.
- Pangu-coder: Program synthesis with function-level language modeling. CoRR, abs/2207.11280, 2022. doi: 10.48550/ARXIV.2207.11280. URL https://doi.org/10.48550/arXiv.2207.11280.
- Dave Halter. GitHub - davidhalter/jedi: Awesome autocompletion, static analysis and refactoring library for python — github.com. https://github.com/davidhalter/jedi.
- Codefuse-13b: A pretrained multi-lingual code large language model. CoRR, abs/2310.06266, 2023. doi: 10.48550/ARXIV.2310.06266. URL https://doi.org/10.48550/arXiv.2310.06266.
- Cocomic: Code completion by jointly modeling in-file and cross-file context. ArXiv, abs/2212.10007, 2022. URL https://api.semanticscholar.org/CorpusID:254877371.
- Crosscodeeval: A diverse and multilingual benchmark for cross-file code completion, 2023.
- CodeBERT: A pre-trained model for programming and natural languages. In Cohn, T., He, Y., and Liu, Y. (eds.), Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 1536–1547, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.139. URL https://aclanthology.org/2020.findings-emnlp.139.
- Incoder: A generative model for code infilling and synthesis. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/pdf?id=hQwb-lbM6EL.
- UniXcoder: Unified cross-modal pre-training for code representation. In Muresan, S., Nakov, P., and Villavicencio, A. (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 7212–7225, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.499. URL https://aclanthology.org/2022.acl-long.499.
- Deepseek-coder: When the large language model meets programming – the rise of code intelligence, 2024.
- Grace: Language models meet code edits. To appear at ESEC/FSE ’23, September 2023. URL https://www.microsoft.com/en-us/research/publication/grace/.
- Big code != big vocabulary: open-vocabulary models for source code. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, ICSE ’20, pp. 1073–1085, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450371216. doi: 10.1145/3377811.3380342. URL https://doi.org/10.1145/3377811.3380342.
- Starcoder: may the source be with you!, 2023.
- Repobench: Benchmarking repository-level code auto-completion systems, 2023.
- ReACC: A retrieval-augmented code completion framework. In Muresan, S., Nakov, P., and Villavicencio, A. (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 6227–6240, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.431. URL https://aclanthology.org/2022.acl-long.431.
- Codegen2: Lessons for training llms on programming and natural languages. ICLR, 2023.
- Better context makes better code language models: A case study on function call argument completion. In Williams, B., Chen, Y., and Neville, J. (eds.), Thirty-Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2023, Washington, DC, USA, February 7-14, 2023, pp. 5230–5238. AAAI Press, 2023. doi: 10.1609/AAAI.V37I4.25653. URL https://doi.org/10.1609/aaai.v37i4.25653.
- Code llama: Open foundation models for code, 2023.
- Repofusion: Training code models to understand your repository, 2023a.
- Repository-level prompt generation for large language models of code. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023b.
- How practitioners expect code completion? In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2023, pp. 1294–1306, New York, NY, USA, 2023a. Association for Computing Machinery. ISBN 9798400703270. doi: 10.1145/3611643.3616280. URL https://doi.org/10.1145/3611643.3616280.
- Practitioners’ expectations on code completion. CoRR, abs/2301.03846, 2023b. doi: 10.48550/ARXIV.2301.03846. URL https://doi.org/10.48550/arXiv.2301.03846.
- Modeling and discovering vulnerabilities with code property graphs. In 2014 IEEE Symposium on Security and Privacy, pp. 590–604, 2014. doi: 10.1109/SP.2014.44.
- A systematic literature review on source code similarity measurement and clone detection: techniques, applications, and challenges, 2023.
- Private-library-oriented code generation with large language models, 2023.
- RepoCoder: Repository-level code completion through iterative retrieval and generation. In Bouamor, H., Pino, J., and Bali, K. (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 2471–2484, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.151. URL https://aclanthology.org/2023.emnlp-main.151.
- Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x. CoRR, abs/2303.17568, 2023. doi: 10.48550/ARXIV.2303.17568. URL https://doi.org/10.48550/arXiv.2303.17568.