Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Hierarchical Context Pruning: Optimizing Real-World Code Completion with Repository-Level Pretrained Code LLMs (2406.18294v2)

Published 26 Jun 2024 in cs.CL

Abstract: Some recently developed code LLMs (Code LLMs) have been pre-trained on repository-level code data (Repo-Code LLMs), enabling these models to recognize repository structures and utilize cross-file information for code completion. However, in real-world development scenarios, simply concatenating the entire code repository often exceeds the context window limits of these Repo-Code LLMs, leading to significant performance degradation. In this study, we conducted extensive preliminary experiments and analyses on six Repo-Code LLMs. The results indicate that maintaining the topological dependencies of files and increasing the code file content in the completion prompts can improve completion accuracy; pruning the specific implementations of functions in all dependent files does not significantly reduce the accuracy of completions. Based on these findings, we proposed a strategy named Hierarchical Context Pruning (HCP) to construct completion prompts with high informational code content. The HCP models the code repository at the function level, maintaining the topological dependencies between code files while removing a large amount of irrelevant code content, significantly reduces the input length for repository-level code completion. We applied the HCP strategy in experiments with six Repo-Code LLMs, and the results demonstrate that our proposed method can significantly enhance completion accuracy while substantially reducing the length of input. Our code and data are available at https://github.com/Hambaobao/HCP-Coder.

Hierarchical Context Pruning: Optimizing Real-World Code Completion with Repository-Level Pretrained Code LLMs

The paper "Hierarchical Context Pruning: Optimizing Real-World Code Completion with Repository-Level Pretrained Code LLMs" presents a novel approach to enhancing the performance of code completion tools within real-world software development environments. This research is centered on managing the extensive context requirements of code completion models pretrained on repository-level datasets, referred to as Repo-Code LLMs.

Key Insights and Methodology

The paper acknowledges the constraint posed by the limited context window of Repo-Code LLMs, which can result in performance degradation when handling large repositories. To tackle this challenge, the authors conducted an extensive series of experiments across six diverse Repo-Code LLMs, investigating both dependency management and the pruning of context within repositories.

Hierarchical Context Pruning Strategy

Their core contribution is the Hierarchical Context Pruning (HCP) strategy—a method that models a codebase at the function level. Rather than attempting to feed entire repositories into the model, HCP selectively includes high-value, topological structures and eliminates redundant or irrelevant content. This strategy effectively reduces input size while preserving or improving code completion accuracy.

  • Dependency Management: The analysis revealed that retaining topological dependencies between code files enhances completion accuracy—specifically, dependencies at a depth level of one are most influential. Beyond the first level of dependencies, there seems to be diminished impact on accuracy, suggesting a practical limit to beneficial depth.
  • Content Pruning: By examining the composition of code files, the authors found that removing function implementations in favor of function headers barely affects results. These findings were integrated into HCP to prune irrelevant code content that does not significantly contribute to the accuracy of model predictions.

In their experimental setup, involving diverse repository levels and sizes, HCP demonstrated a marked improvement in handling the context window constraints, achieving reduced input lengths (from over 50,000 tokens to approximately 8,000) without sacrificing predictive accuracy.

Results and Implications

Applied across a suite of six prominent Repo-Code LLMs, HCP consistently outperformed baseline methods by substantially improving the accuracy of code completions. Specifically, these tests were conducted on the CrossCodeEval benchmark, where performance was measured using two metrics: Exact Match (EM) and Edit Similarity (ES). The authors reported a significant rise in completion accuracy across the board, establishing HCP as a reliable technique for optimizing real-world code-completion scenarios.

The implications of this research are significant for both theoretical and practical applications. Theoretically, it presents a robust framework for understanding how repository-level context can be optimized in LLMs without necessitating a proportionate increase in computational resources. Practically, this innovation stands to enhance the efficiency of tools such as GitHub Copilot and similar systems, making them more viable in processing large-scale codebases.

Future Directions

The research opens several avenues for future exploration. Extending HCP for languages other than Python, improving the granularity of dependency analysis, and possibly integrating machine-learned dependencies could further refine the strategy. Additionally, as software development practices evolve, further research might explore scaling these models for even denser repositories, informing both training techniques and real-time application concepts.

In summary, the introduction of Hierarchical Context Pruning represents a strategic advancement in managing the challenge of context limitations in code LLMs. By foregrounding a nuanced approach to context pruning and dependency management, this research provides effective measures for optimizing code completion tasks within repository-level data ecosystems.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Lei Zhang (1689 papers)
  2. Yunshui Li (18 papers)
  3. Jiaming Li (45 papers)
  4. Xiaobo Xia (43 papers)
  5. Jiaxi yang (31 papers)
  6. Run Luo (22 papers)
  7. Minzheng Wang (9 papers)
  8. Longze Chen (16 papers)
  9. Junhao Liu (60 papers)
  10. Min Yang (239 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

Youtube Logo Streamline Icon: https://streamlinehq.com