CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion (2310.11248v2)

Published 17 Oct 2023 in cs.LG, cs.CL, and cs.SE

Abstract: Code completion models have made significant progress in recent years, yet current popular evaluation datasets, such as HumanEval and MBPP, predominantly focus on code completion tasks within a single file. This over-simplified setting falls short of representing the real-world software development scenario where repositories span multiple files with numerous cross-file dependencies, and accessing and understanding cross-file context is often required to complete the code correctly. To fill in this gap, we propose CrossCodeEval, a diverse and multilingual code completion benchmark that necessitates an in-depth cross-file contextual understanding to complete the code accurately. CrossCodeEval is built on a diverse set of real-world, open-sourced, permissively-licensed repositories in four popular programming languages: Python, Java, TypeScript, and C#. To create examples that strictly require cross-file context for accurate completion, we propose a straightforward yet efficient static-analysis-based approach to pinpoint the use of cross-file context within the current file. Extensive experiments on state-of-the-art code LLMs like CodeGen and StarCoder demonstrate that CrossCodeEval is extremely challenging when the relevant cross-file context is absent, and we see clear improvements when adding these context into the prompt. However, despite such improvements, the pinnacle of performance remains notably unattained even with the highest-performing model, indicating that CrossCodeEval is also capable of assessing model's capability in leveraging extensive context to make better code completion. Finally, we benchmarked various methods in retrieving cross-file context, and show that CrossCodeEval can also be used to measure the capability of code retrievers.

PDF Abstract

CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion

The paper examines the landscape of code completion models and identifies a significant gap in existing evaluation methodologies, which predominantly focus on within-file context, resulting in less effective reflection of a model's true potential. The authors propose and introduce CrossCodeEval, a benchmark designed to evaluate code LLMs' abilities in comprehending and utilizing cross-file context. This benchmark is premised on the observation that real-world coding projects often involve multiple files, with complex interdependencies that current benchmarks like HumanEval and MBPP fail to capture accurately.

Key Contributions

Novel Benchmark: CrossCodeEval is a multilingual, cross-file code completion benchmark, crafted from real-world, open-source repositories across Python, Java, TypeScript, and C#. The authors employ a rigorous static-analysis-based approach to ensure the benchmark strictly involves scenarios where understanding cross-file context is indispensable for code completion.
Dataset Construction: Built with precision, CrossCodeEval includes 10,000 examples extracted from 1,000 repositories, ensuring minimal overlap with existing large model training datasets. The dataset spans four of the most prevalent languages, offering a broad spectrum for evaluating code models.
Comprehensive Evaluation: The authors conducted a series of evaluations on contemporary models such as CodeGen, StarCoder, and GPT-3.5-Turbo. They performed these tests under various settings, highlighting the substantial performance improvements seen when cross-file context is incorporated. Notably, the best models still fall short of achieving optimal performance, indicating a significant potential for future model enhancements.

Numerical Insights

The experimentation revealed that when cross-file context is absent, models perform poorly, underscoring the inadequacies of current benchmarks which do not account for cross-file dependencies. For instance, the StarCoder model at 15.5B showed significant performance improvements, achieving up to 4.5 times better code exact match when the retrieved context is considered. This indicates the necessity and effectiveness of including cross-file context in appropriate benchmarking.

Implications and Future Directions

Practical Implications: The introduction of CrossCodeEval marks a significant step towards more realistic evaluation frameworks for code LLMs. It addresses the nuanced needs of software development environments where AI models must interpret and integrate information across multiple files.

Theoretical Implications: The limitations and stringent conditions of CrossCodeEval provide a fertile ground for exploring retrieval-augmented generation frameworks further. Moreover, they invite the refinement of cross-file dependency analysis and retrieval strategies, potentially driving advancements in training data curation to avoid memorization issues in future models.

Next Steps in AI: This research paves the way for developing sophisticated methods for retrieving and utilizing context from large codebases, resonant with modern AI's journey toward achieving higher levels of understanding and generation capabilities. Enhancements in retrieval mechanisms and the design of abstractive code completion approaches hold promise for significant advancements in AI-assisted software development.

In conclusion, CrossCodeEval sets a new standard for evaluating code LLMs by factoring in cross-file dependencies, thus providing a more accurate reflection of their real-world usage. This work indicates both a significant gap in current evaluation methodologies and points to promising avenues for future exploration and development of more robust AI systems in the field of code completion.

PDF Markdown Bookmark Chat (Pro)

Authors (11)

Yangruibo Ding (17 papers)
Zijian Wang (99 papers)
Wasi Uddin Ahmad (41 papers)
Hantian Ding (11 papers)
Ming Tan (20 papers)
Nihal Jain (9 papers)
Murali Krishna Ramanathan (13 papers)
Ramesh Nallapati (38 papers)
Parminder Bhatia (50 papers)
Dan Roth (222 papers)
Bing Xiang (74 papers)

Citations (80)

View on Semantic Scholar

CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion (2310.11248v2)