Evaluating Multilingual Code Completion with Mrc-Eval
The paper presents an in-depth paper on repository-level code completion across a wide range of programming languages using a newly developed benchmark, Mrc-Eval. The paper responds to the current limitations in evaluating the multilingual capabilities of code LLMs by introducing a benchmark covering 18 programming languages. This offers a significant improvement over existing benchmarks that focus on fewer languages and provide limited evaluation metrics.
Key Contributions
- Multilingual Benchmark: The Mrc-Eval is notable for its expansive language coverage. It includes 18 diverse programming languages, which affords researchers a more comprehensive evaluation framework for code LLMs.
- Fine-Grained Annotations: Two levels of annotations, bucket-level and semantic-level, are employed. These annotations provide insights into the completion capabilities of code LLMs across different contexts and complexities, based on an abstract syntax tree (AST).
- Supplementary Dataset: The creation of Mrc-Instruct, a multilingual instruction corpus, supports the enhancement of repository-level code completion models.
- Comprehensive Evaluation: The experiments demonstrate the efficacy of Mrc-Eval in gauging the abilities of popular code LLMs like StarCoder, DeepSeekCoder, and Code Llama across various evaluation metrics.
Results and Observations
The experimental results highlight the advantage of incorporating cross-file contexts in code completion tasks. Utilizing supplementary contexts significantly improves model performance, underscoring the importance of repository-level understanding in LLMs. Notably, fine-tuning with the Mrc-Instruct dataset achieves considerable improvements, indicating that specific instruction tuning can enhance cross-language code completion capabilities.
Additionally, the paper reveals differing performance levels of code LLMs across programming languages and annotation types. This suggests that language-specific syntactic constructs influence model efficacy. The bucket-level results indicate declining performance with increasingly shallow nodes, whereas semantic annotations reveal strong performance in handling identifiers but weaknesses with special language structures.
Implications and Future Directions
The research contributes fundamentally to the field of multilingual code intelligence by providing tools and resources to better evaluate and develop code LLMs. The findings suggest promising directions for future research, such as the development of models that are better suited to handle syntactic and semantic nuances across diverse programming languages.
The introduction of detailed annotations opens pathways for more granular insights into model performance, offering opportunities to refine model architectures and training regimens further. The reliance on textual metrics like EM and ES, while indicative, hints at the need for execution-based evaluations to capture true semantic equivalence.
In conclusion, Mrc-Eval offers a substantial step forward in the assessment of LLM capabilities, promoting advancements in code completion technologies and software automation across multilingual environments. As the field progresses, such comprehensive evaluations will be crucial in aligning LLM development with practical software engineering needs.