A Deep Dive into Retrieval-Augmented Generation for Code Completion: Experience on WeChat (2507.18515v1)

Published 24 Jul 2025 in cs.SE

Abstract: Code completion, a crucial task in software engineering that enhances developer productivity, has seen substantial improvements with the rapid advancement of LLMs. In recent years, retrieval-augmented generation (RAG) has emerged as a promising method to enhance the code completion capabilities of LLMs, which leverages relevant context from codebases without requiring model retraining. While existing studies have demonstrated the effectiveness of RAG on public repositories and benchmarks, the potential distribution shift between open-source and closed-source codebases presents unique challenges that remain unexplored. To mitigate the gap, we conduct an empirical study to investigate the performance of widely-used RAG methods for code completion in the industrial-scale codebase of WeChat, one of the largest proprietary software systems. Specifically, we extensively explore two main types of RAG methods, namely identifier-based RAG and similarity-based RAG, across 26 open-source LLMs ranging from 0.5B to 671B parameters. For a more comprehensive analysis, we employ different retrieval techniques for similarity-based RAG, including lexical and semantic retrieval. Based on 1,669 internal repositories, we achieve several key findings: (1) both RAG methods demonstrate effectiveness in closed-source repositories, with similarity-based RAG showing superior performance, (2) the effectiveness of similarity-based RAG improves with more advanced retrieval techniques, where BM25 (lexical retrieval) and GTE-Qwen (semantic retrieval) achieve superior performance, and (3) the combination of lexical and semantic retrieval techniques yields optimal results, demonstrating complementary strengths. Furthermore, we conduct a developer survey to validate the practical utility of RAG methods in real-world development environments.

Summary

The paper demonstrates that similarity-based RAG, especially the hybrid BM25+GTE-Qwen approach, significantly improves code completion performance over base LLMs.
The methodology employs identifier- and similarity-based retrieval across 26 open-source LLMs, addressing challenges in proprietary C++ code extraction and distributional shifts.
Key implications include enhanced industrial code completion without retraining, validated by developer surveys and offering actionable insights for closed-source environments.

Retrieval-Augmented Generation for Code Completion in Closed-Source Industrial Environments

Introduction and Motivation

This work presents a comprehensive empirical paper of retrieval-augmented generation (RAG) for code completion in closed-source, industrial-scale codebases, specifically within the WeChat ecosystem. The paper addresses the distributional shift and unique challenges posed by proprietary codebases, which often diverge from open-source repositories in terms of code patterns, frameworks, and domain-specific constructs. The authors systematically evaluate two principal RAG paradigms—identifier-based and similarity-based—across 26 open-source LLMs (0.5B–671B parameters), leveraging a curated benchmark and a large-scale internal code corpus.

Figure 1: Distribution of benchmark examples across domains and difficulty levels, reflecting real-world code completion scenarios in WeChat.

Benchmark and Retrieval Corpus Construction

A function-level benchmark was constructed via manual annotation by experienced developers, ensuring the inclusion of contextually rich, production-grade code examples spanning seven enterprise-relevant domains. The retrieval corpus comprises 1,669 internal C++ repositories, processed using a fine-grained extraction algorithm that addresses C++-specific challenges: header file segmentation, recursive dependencies, auto-generated code (e.g., protobuf), and macro handling. The extraction pipeline utilizes tree-sitter for AST-based parsing and regular expressions for protobuf, yielding a corpus of function/class definitions, declarations, and message types suitable for both identifier and similarity-based retrieval.

RAG Methodologies

Identifier-Based RAG

This approach retrieves background knowledge (e.g., function/class/message definitions) relevant to identifiers present in the code context. The process involves:

Index Creation: Building type-specific indices for fast lookup.
Identifier Extraction: Using a strong LLM to extract required identifiers from the code context.
Prompt Construction: Assembling prompts that inject retrieved background knowledge into the LLM input.

Similarity-Based RAG

This paradigm retrieves code snippets similar to the current context using either lexical or semantic similarity:

Lexical Retrieval: BM25 is employed for term-based matching, with TF-IDF and document length normalization.
Semantic Retrieval: Embedding-based retrieval using models such as CodeBERT, UniXcoder, CoCoSoDa, and GTE-Qwen, with cosine similarity in the embedding space.
Prompt Construction: Retrieved similar code is concatenated with the current context in the prompt.

Experimental Setup

The evaluation spans 26 open-source LLMs, including both code-specialized and general-purpose models, with parameter counts ranging from 0.5B to 671B. The primary metrics are CodeBLEU (CB) and Edit Similarity (ES), capturing both syntactic and semantic fidelity. All models are deployed with consistent inference settings (temperature=0), and retrieval is limited to four candidates to fit within a 2k token context window.

Empirical Findings

Effectiveness of RAG

Both identifier-based and similarity-based RAG methods consistently improve code completion performance over base LLMs across all model sizes. Notably, similarity-based RAG yields substantially higher gains. For example, Qwen2.5-Coder-14B-Instruct improves from 29.79/48.56 (CB/ES) to 51.12/61.96 with GTE-Qwen-based RAG—a 71.6% and 27.6% relative increase, respectively. DeepSeek-V3 achieves a 71.1% and 33.3% improvement in CB/ES with GTE-Qwen.

Identifier-based RAG is most effective when retrieving function definitions, but is consistently outperformed by similarity-based RAG, especially as model scale increases.

Retrieval Technique Analysis

Among semantic retrieval models, CodeBERT underperforms relative to UniXcoder, CoCoSoDa, and GTE-Qwen, likely due to differences in pretraining objectives (contrastive learning vs. MLM/RTD). BM25, despite its simplicity, demonstrates robust performance across all model scales and query formulations.

A key observation is that most semantic retrieval models perform better when the retrieval query is a complete code snippet, whereas GTE-Qwen excels with incomplete code contexts, aligning well with the requirements of code completion tasks.

Complementarity of Lexical and Semantic Retrieval

There is minimal overlap between the candidates retrieved by BM25 and semantic models, indicating that they capture orthogonal aspects of code similarity. Combining BM25 with GTE-Qwen yields optimal performance in most LLMs, especially at larger scales (7B+). For instance, DeepSeek-V3 achieves 63.62/75.26 (CB/ES) with the BM25+GTE-Qwen combination, surpassing the performance of either technique alone. However, this complementarity is less pronounced in smaller models (<7B).

Developer-Centric Evaluation

A developer survey was conducted to assess the practical utility of RAG-augmented completions. The combination of BM25 and GTE-Qwen consistently received higher subjective quality scores than either technique alone across multiple LLMs. Error analysis revealed that logical errors dominate, suggesting that further improvements in LLM reasoning are necessary.

Figure 2: Developer survey results, highlighting the superiority of combined lexical and semantic retrieval for code completion quality.

Implications and Future Directions

Industrial Applicability: RAG enables open-source LLMs to effectively leverage proprietary codebases without retraining, addressing privacy and data scarcity concerns in industrial environments.
Retrieval Model Alignment: There is a misalignment between the training of semantic retrieval models (on complete code) and their deployment (on incomplete queries). Future work should focus on retrieval models optimized for partial code contexts or leverage architectures like GTE-Qwen that are robust to such scenarios.
Hybrid Retrieval Strategies: The demonstrated complementarity between lexical and semantic retrieval suggests that hybrid strategies—beyond simple retrieve-then-rerank pipelines—can yield further gains, particularly in large-scale models.

Limitations

Codebase Specificity: The paper is conducted on WeChat's codebase, which may not generalize to all proprietary environments. However, the diversity of the corpus mitigates this concern.
Metric Limitations: Automated metrics may not fully capture semantic correctness; human evaluation is necessary for comprehensive assessment.

Conclusion

This paper provides a rigorous, large-scale evaluation of RAG for code completion in closed-source settings. Both identifier-based and similarity-based RAG methods are effective, with similarity-based RAG—especially hybrid BM25+GTE-Qwen retrieval—yielding the highest gains. The findings offer actionable guidance for deploying RAG-augmented code completion in industrial environments and highlight avenues for future research in retrieval model design and hybrid retrieval strategies.

PDF Markdown

Follow-up Questions

Related Papers

Authors (6)

YouTube

Show All Videos

alphaXiv

A Deep Dive into Retrieval-Augmented Generation for Code Completion: Experience on WeChat (22 likes, 0 questions)