Long Code Arena: a Set of Benchmarks for Long-Context Code Models (2406.11612v1)

Published 17 Jun 2024 in cs.LG, cs.AI, cs.IR, and cs.SE

Abstract: Nowadays, the fields of code and natural language processing are evolving rapidly. In particular, models become better at processing long context windows - supported context sizes have increased by orders of magnitude over the last few years. However, there is a shortage of benchmarks for code processing that go beyond a single file of context, while the most popular ones are limited to a single method. With this work, we aim to close this gap by introducing Long Code Arena, a suite of six benchmarks for code processing tasks that require project-wide context. These tasks cover different aspects of code processing: library-based code generation, CI builds repair, project-level code completion, commit message generation, bug localization, and module summarization. For each task, we provide a manually verified dataset for testing, an evaluation suite, and open-source baseline solutions based on popular LLMs to showcase the usage of the dataset and to simplify adoption by other researchers. We publish the benchmark page on HuggingFace Spaces with the leaderboard, links to HuggingFace Hub for all the datasets, and link to the GitHub repository with baselines: https://huggingface.co/spaces/JetBrains-Research/long-code-arena.

PDF HTML Abstract

Overview of "Long Code Arena: a Set of Benchmarks for Long-Context Code Models"

This essay provides an analysis of the paper "Long Code Arena: a Set of Benchmarks for Long-Context Code Models" authored by Egor Bogomolov et al. The paper introduces Long Code Arena (LCA), a comprehensive benchmark suite designed to evaluate ML models on various software engineering tasks necessitating long-context understanding. This research attempts to fill significant gaps in the Machine Learning for Software Engineering (ML4SE) landscape, particularly concerning tasks that require project-wide context recognition.

Contributions

The Long Code Arena encompasses six distinct benchmarks:

Library-Based Code Generation
CI Builds Repair
Project-Level Code Completion
Commit Message Generation
Bug Localization
Module Summarization

Each benchmark is meticulously designed to cover different facets of code processing, from generation and repair to completion and summarization, ensuring comprehensive evaluation of code models.

Methodology

The datasets used in Long Code Arena are derived from a common corpus of open-source GitHub repositories adhering to stringent quality criteria (e.g., stars, issues, contributors). The data collection methodology involves multiple levels of filtering and manual verification to ensure high data quality and relevance. For example, the CI builds repair dataset processes CI logs from GitHub Actions, while the Commit Message Generation (CMG) dataset refines data from the CommitChronicle dataset, ensuring coverage of larger commits with comprehensive and meaningful descriptions.

Evaluation Metrics

Each task leverages specific metrics suited to its nature:

Library-Based Code Generation: Evaluated using ChrF and API Recall metrics.
CI Builds Repair: Assessed using the pass rate of CI builds after model interventions.
Project-Level Code Completion: Utilizing exact match of generated lines per predefined categories.
Commit Message Generation: Employing BLEU, ROUGE, ChrF, and BERTScore metrics.
Bug Localization: Using metrics from information retrieval such as Recall@k, Precision@k, F1 score, and Mean Average Precision (MAP).
Module Summarization: Introducing a novel CompScore metric leveraging LLMs as scalable proxies for human judgment.

Results and Discussion

The results across benchmarks reveal substantial variability in model performance. GPT-4, for instance, demonstrated the highest efficacy in several tasks, significantly outperforming other models such as CodeLlama and Mistral across different benchmarks. For the Library-Based Code Generation task, GPT-4 achieved an API Recall of 37%, while for CI Builds Repair, it resolved 17% of samples correctly.

In Project-Level Code Completion, context composition strategies notably affected performance, with models like CodeLlama-7B showing a marked improvement in exact match scores when leveraging repository context judiciously. For CMG tasks, proprietary models such as GPT-4 excelled, achieving a ChrF of 34.4, whereas Mixtral-8x7B proved to be the best open-source model with a ChrF of 32. The introduction of CompScore for Module Summarization provided a nuanced evaluation mechanism, where GPT-4 obtained the highest CompScore of 57.3.

Implications and Future Directions

The practical and theoretical implications of Long Code Arena are multifaceted. Practically, it provides a standardized suite to comprehensively evaluate and compare the performance of code models on various ML4SE tasks. This fosters a deeper understanding of the capabilities of long-context models, guiding future research and development. Theoretically, it sheds light on the complexities and challenges inherent in ML models processing extended context within software projects.

Future developments could explore extending datasets to cover additional programming languages and refining benchmark tasks as models evolve. There is also potential for leveraging these datasets in fine-tuning models specifically for long-context comprehension, further pushing the boundaries of what ML models can achieve in software engineering.

Conclusion

Long Code Arena stands as a seminal addition to the ML4SE domain, aiming to push the envelope on how ML models process and understand long contexts in software projects. The benchmarks offer robust, real-world scenarios, and the meticulous design ensures high data quality and relevance. These contributions position LCA as an indispensable tool for researchers and practitioners alike, driving forward the capabilities and applications of long-context code models.