- The paper demonstrates that similarity-based RAG, especially the hybrid BM25+GTE-Qwen approach, significantly improves code completion performance over base LLMs.
- The methodology employs identifier- and similarity-based retrieval across 26 open-source LLMs, addressing challenges in proprietary C++ code extraction and distributional shifts.
- Key implications include enhanced industrial code completion without retraining, validated by developer surveys and offering actionable insights for closed-source environments.
Retrieval-Augmented Generation for Code Completion in Closed-Source Industrial Environments
Introduction and Motivation
This work presents a comprehensive empirical paper of retrieval-augmented generation (RAG) for code completion in closed-source, industrial-scale codebases, specifically within the WeChat ecosystem. The paper addresses the distributional shift and unique challenges posed by proprietary codebases, which often diverge from open-source repositories in terms of code patterns, frameworks, and domain-specific constructs. The authors systematically evaluate two principal RAG paradigms—identifier-based and similarity-based—across 26 open-source LLMs (0.5B–671B parameters), leveraging a curated benchmark and a large-scale internal code corpus.
Figure 1: Distribution of benchmark examples across domains and difficulty levels, reflecting real-world code completion scenarios in WeChat.
Benchmark and Retrieval Corpus Construction
A function-level benchmark was constructed via manual annotation by experienced developers, ensuring the inclusion of contextually rich, production-grade code examples spanning seven enterprise-relevant domains. The retrieval corpus comprises 1,669 internal C++ repositories, processed using a fine-grained extraction algorithm that addresses C++-specific challenges: header file segmentation, recursive dependencies, auto-generated code (e.g., protobuf), and macro handling. The extraction pipeline utilizes tree-sitter for AST-based parsing and regular expressions for protobuf, yielding a corpus of function/class definitions, declarations, and message types suitable for both identifier and similarity-based retrieval.
RAG Methodologies
Identifier-Based RAG
This approach retrieves background knowledge (e.g., function/class/message definitions) relevant to identifiers present in the code context. The process involves:
- Index Creation: Building type-specific indices for fast lookup.
- Identifier Extraction: Using a strong LLM to extract required identifiers from the code context.
- Prompt Construction: Assembling prompts that inject retrieved background knowledge into the LLM input.
Similarity-Based RAG
This paradigm retrieves code snippets similar to the current context using either lexical or semantic similarity:
- Lexical Retrieval: BM25 is employed for term-based matching, with TF-IDF and document length normalization.
- Semantic Retrieval: Embedding-based retrieval using models such as CodeBERT, UniXcoder, CoCoSoDa, and GTE-Qwen, with cosine similarity in the embedding space.
- Prompt Construction: Retrieved similar code is concatenated with the current context in the prompt.
Experimental Setup
The evaluation spans 26 open-source LLMs, including both code-specialized and general-purpose models, with parameter counts ranging from 0.5B to 671B. The primary metrics are CodeBLEU (CB) and Edit Similarity (ES), capturing both syntactic and semantic fidelity. All models are deployed with consistent inference settings (temperature=0), and retrieval is limited to four candidates to fit within a 2k token context window.
Empirical Findings
Effectiveness of RAG
Both identifier-based and similarity-based RAG methods consistently improve code completion performance over base LLMs across all model sizes. Notably, similarity-based RAG yields substantially higher gains. For example, Qwen2.5-Coder-14B-Instruct improves from 29.79/48.56 (CB/ES) to 51.12/61.96 with GTE-Qwen-based RAG—a 71.6% and 27.6% relative increase, respectively. DeepSeek-V3 achieves a 71.1% and 33.3% improvement in CB/ES with GTE-Qwen.
Identifier-based RAG is most effective when retrieving function definitions, but is consistently outperformed by similarity-based RAG, especially as model scale increases.
Retrieval Technique Analysis
Among semantic retrieval models, CodeBERT underperforms relative to UniXcoder, CoCoSoDa, and GTE-Qwen, likely due to differences in pretraining objectives (contrastive learning vs. MLM/RTD). BM25, despite its simplicity, demonstrates robust performance across all model scales and query formulations.
A key observation is that most semantic retrieval models perform better when the retrieval query is a complete code snippet, whereas GTE-Qwen excels with incomplete code contexts, aligning well with the requirements of code completion tasks.
Complementarity of Lexical and Semantic Retrieval
There is minimal overlap between the candidates retrieved by BM25 and semantic models, indicating that they capture orthogonal aspects of code similarity. Combining BM25 with GTE-Qwen yields optimal performance in most LLMs, especially at larger scales (7B+). For instance, DeepSeek-V3 achieves 63.62/75.26 (CB/ES) with the BM25+GTE-Qwen combination, surpassing the performance of either technique alone. However, this complementarity is less pronounced in smaller models (<7B).
Developer-Centric Evaluation
A developer survey was conducted to assess the practical utility of RAG-augmented completions. The combination of BM25 and GTE-Qwen consistently received higher subjective quality scores than either technique alone across multiple LLMs. Error analysis revealed that logical errors dominate, suggesting that further improvements in LLM reasoning are necessary.
Figure 2: Developer survey results, highlighting the superiority of combined lexical and semantic retrieval for code completion quality.
Implications and Future Directions
- Industrial Applicability: RAG enables open-source LLMs to effectively leverage proprietary codebases without retraining, addressing privacy and data scarcity concerns in industrial environments.
- Retrieval Model Alignment: There is a misalignment between the training of semantic retrieval models (on complete code) and their deployment (on incomplete queries). Future work should focus on retrieval models optimized for partial code contexts or leverage architectures like GTE-Qwen that are robust to such scenarios.
- Hybrid Retrieval Strategies: The demonstrated complementarity between lexical and semantic retrieval suggests that hybrid strategies—beyond simple retrieve-then-rerank pipelines—can yield further gains, particularly in large-scale models.
Limitations
- Codebase Specificity: The paper is conducted on WeChat's codebase, which may not generalize to all proprietary environments. However, the diversity of the corpus mitigates this concern.
- Metric Limitations: Automated metrics may not fully capture semantic correctness; human evaluation is necessary for comprehensive assessment.
Conclusion
This paper provides a rigorous, large-scale evaluation of RAG for code completion in closed-source settings. Both identifier-based and similarity-based RAG methods are effective, with similarity-based RAG—especially hybrid BM25+GTE-Qwen retrieval—yielding the highest gains. The findings offer actionable guidance for deploying RAG-augmented code completion in industrial environments and highlight avenues for future research in retrieval model design and hybrid retrieval strategies.