RepoFusion: Training Code Models to Understand Your Repository (2306.10998v1)
Abstract: Despite the huge success of LLMs in coding assistants like GitHub Copilot, these models struggle to understand the context present in the repository (e.g., imports, parent classes, files with similar names, etc.), thereby producing inaccurate code completions. This effect is more pronounced when using these assistants for repositories that the model has not seen during training, such as proprietary software or work-in-progress code projects. Recent work has shown the promise of using context from the repository during inference. In this work, we extend this idea and propose RepoFusion, a framework to train models to incorporate relevant repository context. Experiments on single-line code completion show that our models trained with repository context significantly outperform much larger code models as CodeGen-16B-multi ($\sim73\times$ larger) and closely match the performance of the $\sim 70\times$ larger StarCoderBase model that was trained with the Fill-in-the-Middle objective. We find these results to be a novel and compelling demonstration of the gains that training with repository context can bring. We carry out extensive ablation studies to investigate the impact of design choices such as context type, number of contexts, context length, and initialization within our framework. Lastly, we release Stack-Repo, a dataset of 200 Java repositories with permissive licenses and near-deduplicated files that are augmented with three types of repository contexts. Additionally, we are making available the code and trained checkpoints for our work. Our released resources can be found at \url{https://huggingface.co/RepoFusion}.
- Naser Al Madi. How readable is model-generated code? examining readability and visual inspection of github copilot. In 37th IEEE/ACM International Conference on Automated Software Engineering, pages 1–5, 2022.
- Santacoder: don’t reach for the stars! arXiv preprint arXiv:2301.03988, 2023.
- Grounded copilot: How programmers interact with code-generating models. Proceedings of the ACM on Programming Languages, 7(OOPSLA1):85–111, 2023.
- Efficient training of language models to fill in the middle. arXiv preprint arXiv:2207.14255, 2022.
- Taking flight with copilot: Early insights and opportunities of ai-powered pair-programming tools. Queue, 20(6):35–57, 2022.
- Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pages 2206–2240. PMLR, 2022.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
- Fido: Fusion-in-decoder optimized for stronger performance and faster inference. ArXiv, abs/2212.08153, 2022.
- Cocomic: Code completion by jointly modeling in-file and cross-file context. arXiv preprint arXiv:2212.10007, 2022.
- Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155, 2020.
- Incoder: A generative model for code infilling and synthesis, 2022.
- Are deep neural networks the best choice for modeling source code? In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2017, page 763–773, New York, NY, USA, 2017. Association for Computing Machinery. ISBN 9781450351058. doi: 10.1145/3106237.3106290.
- Leveraging passage retrieval with generative models for open domain question answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 874–880, Online, April 2021. Association for Computational Linguistics.
- A probabilistic model of information retrieval: development and comparative experiments - part 1. Inf. Process. Manag., 36(6):779–808, 2000. doi: 10.1016/S0306-4573(00)00015-7.
- Generalization through memorization: Nearest neighbor language models. In International Conference on Learning Representations, 2020.
- Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
- The stack: 3 tb of permissively licensed source code. arXiv preprint arXiv:2211.15533, 2022.
- Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161, 2023.
- Competition-level code generation with alphacode, 2022.
- Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019.
- Reacc: A retrieval-augmented code completion framework. arXiv preprint arXiv:2203.07722, 2022.
- Embedding api dependency graph for neural code generation. Empirical Softw. Engg., 26(4), 2021. ISSN 1382-3256. doi: 10.1007/s10664-021-09968-2.
- Reading between the lines: Modeling user behavior and costs in ai-assisted programming. arXiv preprint arXiv:2210.14306, 2022.
- Codegen: An open large language model for code with multi-turn program synthesis. In The Eleventh International Conference on Learning Representations, 2023.
- Retrieval augmented code generation and summarization. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2719–2734, Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-emnlp.232.
- Learning to walk over relational graphs of source code. In Deep Learning for Code Workshop, 2022a.
- Codetrek: Flexible modeling of code using an extensible relational representation. In International Conference on Learning Representations, 2022b.
- An empirical cybersecurity evaluation of github copilot’s code contributions. ArXiv abs/2108.09293, 2021.
- Do users write more insecure code with ai assistants? arXiv preprint arXiv:2211.03622, 2022.
- On-the-fly adaptation of source code models. In NeurIPS 2020 Workshop on Computer-Assisted Programming, 2020.
- Repository-level prompt generation for large language models of code. arXiv preprint arXiv:2206.12839, 2022.
- Intellicode compose: Code generation using transformer. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 1433–1443, 2020.
- Expectation vs. experience: Evaluating the usability of code generation tools powered by large language models. In Chi conference on human factors in computing systems extended abstracts, pages 1–7, 2022.
- CodeT5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8696–8708, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics.
- Capturing structural locality in non-parametric language models. In International Conference on Learning Representations, 2022.
- When language model meets private library. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 277–288, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics.
- Repocoder: Repository-level code completion through iterative retrieval and generation. arXiv preprint arXiv:2303.12570, 2023.
- Learning to generate code comments from class hierarchies, 2021.
- Docprompting: Generating code by retrieving the docs. In The Eleventh International Conference on Learning Representations, 2023.
- Disha Shrivastava (15 papers)
- Denis Kocetkov (5 papers)
- Harm de Vries (29 papers)
- Dzmitry Bahdanau (46 papers)
- Torsten Scholak (14 papers)