LoCoBench: Long-Context LLM Benchmark

Updated 13 September 2025

LoCoBench is a comprehensive benchmark designed to evaluate long-context capabilities in large language models across multi-file, complex software engineering scenarios.
It systematically measures performance using 8,000 scenarios spanning 10 programming languages and 36 application domains with 17 specialized metrics, including the composite LoCoBench Score (LCBS).
The benchmark highlights significant performance gaps in current LLMs under multi-million-token contexts, emphasizing the need for improved architectural reasoning and context coherence.

LoCoBench is a comprehensive large-scale benchmark for evaluating the long-context capabilities of LLMs in complex software engineering scenarios. Unlike previous code evaluation frameworks that focus on single-function completion or short-context tasks, LoCoBench systematically addresses the challenge of assessing LLM performance in settings that require reasoning across entire codebases, understanding multi-file architectures, and maintaining design consistency in realistic, large-scale software projects. LoCoBench provides 8,000 evaluation scenarios constructed across 10 programming languages and 36 application domains, with context windows varying from 10,000 to 1,000,000 tokens. The benchmark incorporates eight core task categories and a multi-dimensional metric framework with seventeen metrics—including eight proposed specifically for long-context analysis—aggregated in a single LoCoBench Score (LCBS). Empirical results demonstrate persistent performance gaps for state-of-the-art models on these challenging tasks and context sizes, highlighting unsolved problems and the need for further research in LLM-based software engineering (Qiu et al., 11 Sep 2025).

1. Motivation and Scope

LoCoBench was motivated by the increasing prevalence of long-context LLMs with multi-million-token context windows and their anticipated impact on advanced software engineering applications. Conventional code benchmarks are inadequate for evaluating long-range dependencies, architectural reasoning, and context continuity relevant to production-scale codebases. LoCoBench fills this gap by designing evaluation scenarios that span entire projects and explicitly require models to resolve cross-file relationships, maintain architecture, and respond coherently across iterative development sessions. The scope of LoCoBench is defined by its coverage of ten programming languages (ensuring equal representation) and selection of 36 distinct application domains, yielding broad applicability in realistic software engineering workflows.

2. Benchmark Generation and Pipeline

The construction of LoCoBench follows a systematic five-phase pipeline:

Project Specification Generation: Detailed requirements and design objectives are generated to guide the synthesis of realistic codebases.
Codebase Synthesis: Complete, multi-file software repositories are programmatically generated, covering various architectures and feature sets.
Scenario Derivation: The synthesized codebases are transformed into 8,000 distinct evaluation scenarios, each parameterized by context length (flexibly ranging from 10K to 1M tokens).
Automated Quality Assurance: Each scenario is validated for structural soundness and completeness, mitigating common artifacts in synthetic benchmarks.
LLM Evaluation and Metric Computation: Models are tested on these scenarios using an extensive set of metrics organized into four evaluation dimensions.

This pipeline ensures both diversity and control across language, architecture style, context length, and application type, enabling isolation of long-context degradation and task-specific weaknesses.

3. Task Categories

LoCoBench encompasses eight central software engineering tasks, each requiring sophisticated long-context reasoning:

Task Category	Context Requirement	Core Challenge
Architectural Understanding	Multi-file, full-system	Design pattern inference, component linking
Cross-File Refactoring	Multi-file, code transformation	Dependency preservation, semantic transfer
Multi-Session Development	Sequential, persistent context	State continuity, incremental evolution
Bug Investigation	Project-wide, execution trace	Cross-module debugging, trace synthesis
Feature Implementation	Insertion into extensive codebases	Integration, conflict resolution
Code Comprehension	Large codebase analysis	Summarization, explantory synthesis
Integration Testing	System-level, multi-component	Interface verification, robustness testing
Security Analysis	Project-wide, vulnerability detection	Threat modeling, policy adherence

Each task was crafted to force long-range dependency tracking, semantic inference across disparate files, and coherent system-wide reasoning, exceeding the difficulty of traditional benchmarks.

4. Evaluation Framework and Metrics

The evaluation methodology comprises seventeen metrics across four dimensions:

Software Engineering Excellence (eight metrics): Includes Architectural Coherence Score (ACS), Dependency Traversal Accuracy (DTA), and Cross-File Reasoning Depth (CFRD). For example,

$ACS(\mathcal{C}) = \frac{1}{|\mathcal{P}|} \sum_p w(p) \cdot \left[\frac{\alpha(p, \mathcal{C})}{\kappa(p) + \varepsilon}\right]$

quantifies design consistency based on project $p$ , weighted by system-specific parameters.

Functional Correctness (four metrics): Compilation Success (CCS), Unit Test Performance (UTP), Integration Test Performance (ITP), and Incremental Development Capability (IDC). IDC measures improvement over sequential multi-session workflows.
Code Quality Assessment (three metrics): Security Adherence Score (SAS), average issues found (AIF, inverted for scoring), and code style adherence (CSA).
Long-Context Utilization (two metrics): Information Coverage Utilization (ICU) evaluates the ratio of essential context information used, and Multi-Session Memory Retention (MMR) quantifies coherent cross-session context usage.

The composite LoCoBench Score (LCBS) is defined as

$LCBS = 5 \cdot (0.4 SE + 0.3 FC + 0.2 CQ + 0.1 LCU)$

where $SE$ , $FC$ , $CQ$ , and $LCU$ are normalized scores for the respective dimensions. Weighting was empirically determined according to benchmarking objectives.

5. Model Evaluation and Observed Results

Evaluations of state-of-the-art LLMs on LoCoBench revealed consistent performance gaps, especially as context length increased from 10K to 1M tokens. In particular, Gemini-2.5-Pro showed strong results in cross-file refactoring and long-context utilization, whereas GPT-5 performed better in architectural reasoning tasks. Notably, all models experienced significant score degradation under expert-level scenarios (1M tokens), reflecting an inability to maintain context coherence and robust reasoning over extensive input lengths. Integration testing and architectural understanding were more tractable, but multi-session development and bug investigation imposed persistent challenges. This suggests existing LLMs are not yet equipped to manage the real-world demands of complex software engineering workflows at scale.

6. Implications and Future Research Directions

LoCoBench introduces a new standard for the evaluation of LLMs in software engineering, promoting rigorous multidimensional assessment and highlighting the need for focused improvement in long-context reasoning, architecture-aware modeling, and session continuity. The explicit measurement of context utilization and coherence underlies broader implications:

Researchers are encouraged to develop novel model architectures and training objectives targeting large-scale, multi-file dependency tracking, and context memory.
Model developers and practitioners can leverage LoCoBench to guide fine-tuning, domain adaptation, and selection procedures for production scenarios demanding complex codebase handling.
A plausible implication is that advances in retrieval-augmented generation, sparse attention, and architectural inductive biases may be especially crucial for closing the current gaps documented by LoCoBench.

7. Benchmark Accessibility and Standardization

LoCoBench is publicly released at https://github.com/SalesforceAIResearch/LoCoBench, facilitating open, reproducible, and standardized comparisons across models. The benchmark design fosters unbiased head-to-head evaluation by ensuring robust scenario diversity, clearly specified metrics with mathematical notation, and normalization procedures for cross-task comparison. The inclusion of eight novel metrics tailored to long-context settings enhances interpretability and scientific rigor, enabling the benchmark to serve as a reference point for future research in LLM-based software development.

LoCoBench’s systematic approach and empirical findings provide a definitive foundation for the next generation of benchmarks and model architectures addressing the unsolved challenge of long-context understanding in complex software engineering workflows (Qiu et al., 11 Sep 2025).

PDF Markdown Chat (Pro)

References (1)

LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering (2025)

LoCoBench: Long-Context LLM Benchmark

1. Motivation and Scope

2. Benchmark Generation and Pipeline

3. Task Categories

4. Evaluation Framework and Metrics

5. Model Evaluation and Observed Results

6. Implications and Future Research Directions

7. Benchmark Accessibility and Standardization

Whiteboard

Follow Topic

Continue Learning

LoCoBench: Long-Context LLM Benchmark

1. Motivation and Scope

2. Benchmark Generation and Pipeline

3. Task Categories

4. Evaluation Framework and Metrics

5. Model Evaluation and Observed Results

6. Implications and Future Research Directions

7. Benchmark Accessibility and Standardization

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics