Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RepoQA: Evaluating Long Context Code Understanding (2406.06025v1)

Published 10 Jun 2024 in cs.SE, cs.CL, and cs.LG

Abstract: Recent advances have been improving the context windows of LLMs. To quantify the real long-context capabilities of LLMs, evaluators such as the popular Needle in a Haystack have been developed to test LLMs over a large chunk of raw texts. While effective, current evaluations overlook the insight of how LLMs work with long-context code, i.e., repositories. To this end, we initiate the RepoQA benchmark to evaluate LLMs on long-context code understanding. Traditional needle testers ask LLMs to directly retrieve the answer from the context without necessary deep understanding. In RepoQA, we built our initial task, namely Searching Needle Function (SNF), which exercises LLMs to search functions given their natural-language description, i.e., LLMs cannot find the desired function if they cannot understand the description and code. RepoQA is multilingual and comprehensive: it includes 500 code search tasks gathered from 50 popular repositories across 5 modern programming languages. By evaluating 26 general and code-specific LLMs on RepoQA, we show (i) there is still a small gap between the best open and proprietary models; (ii) different models are good at different languages; and (iii) models may understand code better without comments.

Evaluating Long-Context Code Understanding with RepoQA

The paper "RepoQA: Evaluating Long Context Code Understanding" introduces a novel benchmark tailored specifically for assessing the long-context code comprehension abilities of LLMs. As the computing community continues to expand the capabilities of LLMs by increasing their context window sizes, the need for thorough evaluation methods becomes increasingly crucial. This paper addresses a critical gap in this field by proposing the RepoQA benchmark geared towards the code domain, diverging from typical benchmarks that concentrate on generic text or synthetic environments.

Benchmark Design

RepoQA provides a structured evaluation focused on the real-world applicability of long-context LLMs for code understanding. The benchmark encompasses 500 code search tasks derived from 50 popular open-source repositories, representing five programming languages: Python, C++, Java, TypeScript, and Rust. The initial task in RepoQA, termed "Searching Needle Function" (SNF), challenges models to locate specific functions based on natural language descriptions. Unlike traditional needle benchmarks, SNF demands an actual understanding of both the function description and the surrounding code, instead of mere retrieval skills.

Key contributions of RepoQA include:

  • Being the inaugural benchmark for long-context code understanding.
  • Introducing an automated pipeline to generate evaluation tasks for SNF.
  • Offering a multilingual and comprehensive dataset for code retrieval tasks.

Evaluation and Findings

The RepoQA benchmark was applied to 33 LLMs, evaluating their proficiency in long-context code understanding. The results highlight several insights:

  1. Model Performance: There remains a noticeable gap between the top-performing open-source and proprietary models, albeit a small one. State-of-the-art proprietary models outperformed open-source counterparts, though top-tier open-source models like DeepSeek-V2-Chat exhibited commendable performance.
  2. Language-Specific Proficiency: Model proficiency varied by programming language, with Java and TypeScript receiving higher accuracy scores. This variation suggests differences in model training datasets, indicating diverse language strengths and weaknesses among models.
  3. Impact of Comments: Surprisingly, the removal of comments often resulted in improved retrieval accuracy, suggesting models may better understand raw code contexts without linguistic support from comments.

Implications and Future Work

RepoQA demonstrates the possibility and necessity of developing benchmarks that reflect real-world software development tasks. It provides a robust platform to assess and compare the long-context capabilities of LLMs in code understanding. As LLMs evolve, understanding their strengths and limits in practical applications becomes crucial for their effective integration into software engineering processes.

Looking forward, we can anticipate the expansion of RepoQA to encompass more complex and challenging tasks that better simulate real-life programming environments and demands. This includes task types that require deeper insights into inter-file and cross-project code dependencies. Additionally, extending the benchmark to cover more languages and a greater diversity of repositories will provide a richer landscape for evaluating advanced LLMs.

Ultimately, RepoQA is positioned as a significant step toward refining the evaluation tools available for long-context LLMs, driving the community toward broader, more nuanced understandings of these models' true capabilities and potential applications in software development.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Jiawei Liu (156 papers)
  2. Jia Le Tian (2 papers)
  3. Vijay Daita (1 paper)
  4. Yuxiang Wei (40 papers)
  5. Yifeng Ding (22 papers)
  6. Yuhan Katherine Wang (1 paper)
  7. Jun Yang (357 papers)
  8. Lingming Zhang (48 papers)
Citations (7)