Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 62 tok/s

Gemini 2.5 Pro 48 tok/s Pro

GPT-5 Medium 14 tok/s Pro

GPT-5 High 13 tok/s Pro

GPT-4o 93 tok/s Pro

Kimi K2 213 tok/s Pro

GPT OSS 120B 458 tok/s Pro

Claude Sonnet 4 38 tok/s Pro

2000 character limit reached

YABLoCo: Yet Another Benchmark for Long Context Code Generation (2505.04406v1)

Published 7 May 2025 in cs.CL, cs.AI, and cs.SE

Abstract: LLMs demonstrate the ability to solve various programming tasks, including code generation. Typically, the performance of LLMs is measured on benchmarks with small or medium-sized context windows of thousands of lines of code. At the same time, in real-world software projects, repositories can span up to millions of LoC. This paper closes this gap by contributing to the long context code generation benchmark (YABLoCo). The benchmark featured a test set of 215 functions selected from four large repositories with thousands of functions. The dataset contained metadata of functions, contexts of the functions with different levels of dependencies, docstrings, functions bodies, and call graphs for each repository. This paper presents three key aspects of the contribution. First, the benchmark aims at function body generation in large repositories in C and C++, two languages not covered by previous benchmarks. Second, the benchmark contains large repositories from 200K to 2,000K LoC. Third, we contribute a scalable evaluation pipeline for efficient computing of the target metrics and a tool for visual analysis of generated code. Overall, these three aspects allow for evaluating code generation in large repositories in C and C++.

Summary

YABLoCo: Yet Another Benchmark for Long Context Code Generation

The paper "YABLoCo: Yet Another Benchmark for Long Context Code Generation" offers a novel contribution to the field of code generation by addressing a gap in existing benchmarks. While many benchmarks evaluate the performance of LLMs on code generation tasks with relatively small context windows, they often do not extend to the larger, more complex real-world software repositories that can span millions of lines of code (LoC). This paper presents YABLoCo, a benchmark specifically designed to evaluate code generation in such extensive codebases, focusing on C and C++ languages, which are underrepresented in current benchmarks.

Key Contributions

Dataset and Benchmark Design: The paper introduces a dataset comprising 215 functions drawn from four substantial repositories ranging from 200K to 2,000K LoC. This dataset includes not only the functions' metadata and bodies but also the call graphs that provide insight into the dependencies between functions within repositories.
Expanded Context Window: YABLoCo emphasizes the necessity of long-context code generation, where context for a function to be generated may include dependencies spanning the same file, multiple files, or even entire packages across complex projects.
Evaluation Pipeline: The authors provide a robust evaluation pipeline, facilitating efficient computation of key metrics such as pass@k and other syntactic similarity measures. This pipeline is designed to support the scalable testing of LLMs against the benchmark dataset.

Methodology

The framework employs a clang-based tool for building and analyzing function call graphs and categorizes function dependencies into five levels ranging from none to project-wide. Through a detailed filtering process, functions are selected based on docstring quality, uniqueness, and the extent of test coverage. Functions with significant dependencies offer challenging scenarios for LLM-based code generation models, pushing their limits in extracting and utilizing relevant contexts.

Evaluation

The research evaluates several LLMs for code generation, such as CodeLlama-13B, DeepSeekCoder-33B, and GPT-4, both with and without contextual information. The baseline performances of these models are established using the YABLoCo benchmark, emphasizing the performance difference when models are provided with varying levels of contextual data.

Pass@10: This metric underlines the capability of models in correctly generating code that passes existing test cases from the selected repositories.
Comparison with Context: Significant improvements in pass@k scores were reported when LLMs were supplied with an 'oracle' context, illustrating the potential for improved code generation when equipped with extensive contextual data.

Discussion and Future Directions

YABLoCo has demonstrated its importance by revealing the limitations of existing LLMs when faced with the complexities of long-context code generation. Its introduction fosters a deeper exploration into context utilization, retrieval-augmented generation, and code dependency resolution. The challenging nature of the benchmark encourages the development of more sophisticated models that can better comprehend and generate code in the face of intricate dependencies.

A notable insight from the experiment is the discrepancy in LLM performance across different repositories, attributed potentially to dataset characteristics and pretraining exposure, which underscores the need for further augmentation and refinement of training datasets for LLMs.

In conclusion, YABLoCo serves as a critical tool for advancing the code generation landscape by quantifying the impact of context and challenging existing LLMs. Future work may involve exploring hybrid approaches that combine traditional retrieval mechanisms with generative capacities to enhance model performance. Furthermore, expanding YABLoCo to include additional programming languages and more diverse codebases could provide a comprehensive understanding of LLM capabilities in software engineering tasks.