Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
133 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FT2Ra: A Fine-Tuning-Inspired Approach to Retrieval-Augmented Code Completion (2404.01554v1)

Published 2 Apr 2024 in cs.SE

Abstract: The rise of code pre-trained models has significantly enhanced various coding tasks, such as code completion, and tools like GitHub Copilot. However, the substantial size of these models, especially large models, poses a significant challenge when it comes to fine-tuning them for specific downstream tasks. As an alternative approach, retrieval-based methods have emerged as a promising solution, augmenting model predictions without the need for fine-tuning. Despite their potential, a significant challenge is that the designs of these methods often rely on heuristics, leaving critical questions about what information should be stored or retrieved and how to interpolate such information for augmenting predictions. To tackle this challenge, we first perform a theoretical analysis of the fine-tuning process, highlighting the importance of delta logits as a catalyst for improving model predictions. Building on this insight, we develop a novel retrieval-based method, FT2Ra, which aims to mimic genuine fine-tuning. While FT2Ra adopts a retrieval-based mechanism, it uniquely adopts a paradigm with a learning rate and multi-epoch retrievals, which is similar to fine-tuning.In token-level completion, which represents a relatively easier task, FT2Ra achieves a 4.29% improvement in accuracy compared to the best baseline method on UniXcoder. In the more challenging line-level completion task, we observe a substantial more than twice increase in Exact Match (EM) performance, indicating the significant advantages of our theoretical analysis. Notably, even when operating without actual fine-tuning, FT2Ra exhibits competitive performance compared to the models with real fine-tuning.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. 2022. GitHub Copilot. https://github.com/features/copilot.
  2. 2022. intellicode. https://visualstudio.microsoft.com/services/intellicode.
  3. 2023. ft2ra website. https://sites.google.com/view/ft2ra/home.
  4. 2023. Stanford University CS229: Machine Learning. https://cs229.stanford.edu/. Accessed: 2023-12-10.
  5. Automatic semantic augmentation of language model prompts (for code summarization). In 2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE). IEEE Computer Society, 1004–1004.
  6. Miltiadis Allamanis and Charles Sutton. 2013. Mining Source Code Repositories at Massive Scale using Language Modeling. In 2013 10th Working Conference on Mining Software Repositories (MSR). IEEE, 207–216.
  7. Neuro-symbolic language modeling with automaton-augmented retrieval. In International Conference on Machine Learning. PMLR, 468–485.
  8. Improving language models by retrieving from trillions of tokens. In International conference on machine learning. PMLR, 2206–2240.
  9. Decoupling knowledge from memorization: Retrieval-augmented prompt learning. Advances in Neural Information Processing Systems 35 (2022), 23908–23922.
  10. Mention memory: incorporating textual knowledge into transformers through entity mention attention. arXiv preprint arXiv:2110.06176 (2021).
  11. You can’t pick your neighbors, or can you? When and how to rely on retrieval in the k𝑘kitalic_k NN-LM. arXiv preprint arXiv:2210.15859 (2022).
  12. Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155 (2020).
  13. Unixcoder: Unified cross-modal pre-training for code representation. arXiv preprint arXiv:2203.03850 (2022).
  14. Retrieval augmented language model pre-training. In International conference on machine learning. PMLR, 3929–3938.
  15. Efficient nearest neighbor language models. arXiv preprint arXiv:2109.04212 (2021).
  16. On the naturalness of software. Commun. ACM 59, 5 (2016), 122–131.
  17. Fid-light: Efficient and effective retrieval-augmented text generation. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1437–1447.
  18. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021).
  19. Codesearchnet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436 (2019).
  20. Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 (2022).
  21. Active retrieval augmented generation. arXiv preprint arXiv:2305.06983 (2023).
  22. Repair is nearly generation: Multilingual program repair with llms. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 5131–5140.
  23. Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906 (2020).
  24. Generalization through Memorization: Nearest Neighbor Language Models. In International Conference on Learning Representations. https://openreview.net/forum?id=HklBjCEKvH
  25. Code prediction by feeding trees to transformers. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 150–162.
  26. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691 (2021).
  27. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems 33 (2020), 9459–9474.
  28. Code completion with neural attention and pointer networks. arXiv preprint arXiv:1711.09573 (2017).
  29. StarCoder: may the source be with you! arXiv preprint arXiv:2305.06161 (2023).
  30. Neural code completion. (2016).
  31. Retrieval-augmented generation for code summarization via hybrid gnn. arXiv preprint arXiv:2006.05405 (2020).
  32. Atom: Commit message generation based on abstract syntax tree and hybrid ranking. IEEE Transactions on Software Engineering 48, 5 (2020), 1800–1817.
  33. Commitbart: A large pre-trained model for github commits. arXiv preprint arXiv:2208.08100 (2022).
  34. Contrabert: Enhancing code pre-trained models via contrastive learning. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2476–2487.
  35. GPT understands, too. AI Open (2023).
  36. Reacc: A retrieval-augmented code completion framework. arXiv preprint arXiv:2203.07722 (2022).
  37. Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664 (2021).
  38. Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474 (2022).
  39. OpenAI. 2023. ChatGPTblog. https://openai.com/blog/chatgpt.
  40. Retrieval augmented code generation and summarization. arXiv preprint arXiv:2108.11601 (2021).
  41. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
  42. In-context retrieval-augmented language models. arXiv preprint arXiv:2302.00083 (2023).
  43. Probabilistic Model for Code with Decision Trees. ACM SIGPLAN Notices (2016), 731–747.
  44. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends® in Information Retrieval 3, 4 (2009), 333–389.
  45. Replug: Retrieval-augmented black-box language models. arXiv preprint arXiv:2301.12652 (2023).
  46. Repository-level prompt generation for large language models of code. In International Conference on Machine Learning. PMLR, 31693–31715.
  47. Domain Adaptive Code Completion via Language Models and Decoupled Domain Databases. arXiv preprint arXiv:2308.09313 (2023).
  48. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. arXiv preprint arXiv:2109.00859 (2021).
  49. Learning to Filter Context for Retrieval-Augmented Generation. arXiv preprint arXiv:2311.08377 (2023).
  50. Wikipedia. 2023a. Empirical Probability. https://en.wikipedia.org/wiki/Empirical_probability.
  51. Wikipedia. 2023b. Frequency (statistics). https://en.wikipedia.org/wiki/Frequency_(statistics).
  52. Why do Nearest Neighbor Language Models Work? arXiv preprint arXiv:2301.02828 (2023).
  53. Leandojo: Theorem proving with retrieval-augmented language models. arXiv preprint arXiv:2306.15626 (2023).
  54. Repocoder: Repository-level code completion through iterative retrieval and generation. arXiv preprint arXiv:2303.12570 (2023).
Citations (6)

Summary

We haven't generated a summary for this paper yet.