LoCoBench: Long-Context LLM Evaluation

Updated 30 September 2025

LoCoBench is a benchmark suite for evaluating long-context LLMs, featuring realistic multi-file scenarios and a multidimensional metric framework.
It employs a systematic five-phase pipeline to generate synthetic codebases in 10 programming languages, spanning tasks from architectural reasoning to bug investigation.
Empirical findings reveal systematic performance degradation with increased context size and underscore challenges in architectural coherence and multi-session memory retention.

LoCoBench is a benchmark suite specifically designed for the evaluation of long-context LLMs in complex software engineering scenarios, addressing the critical gap left by short-context code evaluation tasks. Unlike previous benchmarks such as HumanEval or SWE-Bench, LoCoBench pushes the evaluation frontier toward whole-codebase understanding, multi-file reasoning, and system-level architectural coherence. The benchmark comprises 8,000 systematically generated evaluation scenarios spanning 10 programming languages, with context lengths ranging from 10,000 up to 1,000,000 tokens—a range that enables precise characterization of performance degradation as context increases. LoCoBench introduces eight task categories that encapsulate essential long-context capabilities and provides a multidimensional evaluation framework, combining 17 metrics (including eight novel ones) into a unified LoCoBench Score (LCBS) to facilitate rigorous comparative analysis.

1. Benchmark Architecture and Context Scope

LoCoBench's design leverages a five-phase pipeline to ensure diversity, scalability, and realism:

Generation of Project Specifications: 1,000 synthetic project specifications are produced, each with requirements that resemble realistic software engineering workflows, including multi-file architecture and documentation structures.
Synthetic Codebase Construction: Using the generated specifications, complete synthetic codebases are instantiated in 10 programming languages, with folder structures, dependency graphs, and cross-module linkages.
Scenario Engineering: Each codebase is transformed into eight scenarios covering distinct long-context reasoning tasks (see Section 2), stratified into four difficulty levels ("easy" to "expert") that correlate with increasing context lengths: 10K, 100K, 500K, and 1M tokens.
Metric Annotation and Ground Truth Generation: Scenario-specific evaluation criteria are constructed, including ground truth outputs and expected codebase transformations.
Scenario Quality Control and Calibration: Final scenarios undergo quality control, calibration for context window coverage, and statistical analysis to balance difficulty and diversity.

This 100× variation in context enables the paper of long-context scaling properties, performance degradation, and cross-language generalization.

2. Task Categories and Scenario Taxonomy

Each LoCoBench codebase yields eight distinct scenario types, covering the main cognitive demands of large-scale software engineering:

Task Category	Core Challenge	Example Scenario
Architectural Understanding	Global system reasoning	Classify or modify architecture pattern in a large project
Cross-File Refactoring	Inter-module consistency	Update APIs across multiple files & dependencies
Feature Implementation	Integration of new logic	Add new modules or functionality with proper system hooks
Bug Investigation	Multi-file debugging	Trace and fix root cause spanning several files
Multi-Session Development	Stateful memory across sessions	Continue development with cumulative code changes
Code Comprehension	High-level summarization	Produce module summaries given documentation and source
Integration Testing	System-wide validation	Design/execute cross-module integration tests
Security Analysis	Vulnerability identification	Audit for potential issues across files and dependencies

Each scenario within a category emphasizes specific long-context reasoning requirements, e.g., architectural coherence or cross-file dependency traversal.

3. Evaluation Metrics and the LoCoBench Score

LoCoBench quantifies model capabilities using a 17-metric evaluation framework partitioned into four scoring dimensions:

Software Engineering Excellence (8 metrics): Includes Architectural Coherence Score (ACS), System Thinking Score, Dependency Traversal Accuracy (DTA), Cross-File Reasoning Depth (CFRD), Robustness, Comprehensiveness, Innovation, and Solution Elegance. ACS is defined by:

$\text{ACS} = \frac{1}{|\mathcal{P}|} \sum_p \frac{w(p)\cdot \alpha(p,C)}{\kappa(p)+\varepsilon}$

where $\mathcal{P}$ is the set of recognized architectural patterns, $w(p)$ the pattern criticality, $\alpha(p,C)$ adherence, $\kappa(p)$ complexity, and $\varepsilon$ a small constant.

Functional Correctness (4 metrics): Compilation success, unit/integration test pass rate, Incremental Development Capability (IDC), measuring the model's ability to build on previous sessions.
Code Quality Assessment (3 metrics): Security Analysis Score, Average Issues Found (reverse count), and Code Style Adherence.
Long-Context Utilization (2 metrics): Information Coverage Utilization (ICU) quantifies effective context use, while Multi-Session Memory Retention (MMR) quantifies retention and utilization of information from prior development sessions.

The aggregate metric, LoCoBench Score (LCBS), is given by:

$\text{LCBS} = 5 \cdot (0.4 \cdot SE + 0.3 \cdot FC + 0.2 \cdot CQ + 0.1 \cdot LCU)$

where $SE$ , $FC$ , $CQ$ , and $LCU$ are normalized dimension subscores, yielding an interpretable scalar in $[0,5]$ .

4. Model Performance Analysis and Empirical Findings

Benchmarked models (e.g., Gemini-2.5-Pro, GPT-5, Claude-Sonnet) show substantial variance in long-context capabilities:

Architectural Reasoning: GPT-5 exhibits strength in system-level understanding, as measured by ACS metrics, compared to its peers.
Cross-File Operations: Gemini-2.5-Pro is superior in refactoring across files, with minimized performance drop as context length approaches 1M tokens.
Specialization and Gaps: Certain models maintain unit test accuracy or compilation success at all context widths, but drop sharply in integration and multi-session tasks.
Context Utilization: Both ICU and MMR metrics reveal that information utilization falls substantially as scenario difficulty grows, particularly with increased context size and inter-session reasoning requirements.

Heatmaps and radar charts presented in the original work give detailed breakdowns per programming language and category, showing high-level languages yield better scores while lower-level languages and expert scenarios highlight the "long-context capability gap."

5. Implications, Limitations, and Future Research Directions

LoCoBench uncovers several unmet challenges in current long-context LLM technology:

Performance Degradation: There is systematic decline in LCBS as context size scales from 10K to 1M, indicating context-window expansion has not solved the underlying architectural/generalization bottlenecks.
Multi-Session Memory: MMR scores demonstrate that memory retention across development sessions remains a weak point, necessitating further research into persistent memory architectures and incremental context training.
Architectural Coherence and Cross-File Reasoning: Highly-specialized models may achieve functional correctness but perform poorly on ACS and cross-file metrics, suggesting that architectural understanding must be explicitly targeted in future model development and training regimens.
Evaluation Multidimensionality: The analysis supports that correctness alone is insufficient; evaluation must include architectural, security, and stylistic facets, motivating multidimensional benchmarks.

A plausible implication is that advancements in context management (e.g., hierarchical memory and retrieval capacities) or hybrid symbolic-neural modeling may be necessary to close the observed long-context capability gap.

6. Benchmark Availability and Role in the Software Engineering Community

LoCoBench is released by SalesforceAIResearch at https://github.com/SalesforceAIResearch/LoCoBench and is accompanied by full benchmark datasets, metric calculation infrastructure, and synthetic codebases across 10 programming languages. By providing rigorously calibrated long-context scenarios and unified metrics, LoCoBench is intended to serve as a reference suite for the development, comparative evaluation, and progress tracking of long-context LLMs designed for complex software engineering workflows.

The comprehensive methodology, scenarios, and LaTeX formulas found in LoCoBench establish a high standard for future model evaluation, drive further research into true codebase-scale reasoning, and address previously neglected dimensions in model assessment for industrial and academic communities.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to LoCo Benchmark.