LoCoBench-Agent Benchmark
- LoCoBench-Agent is an interactive benchmark framework designed to assess LLM agents in multi-turn, long-context software engineering tasks.
- It transforms 8,000 single-turn code-understanding scenarios into interactive development sessions, rigorously evaluating adaptive reasoning, tool usage, and error recovery.
- By simulating real-world environments with minimal, empty, and full initialization modes, it provides a comprehensive methodology for measuring agent performance under varied information constraints.
LoCoBench-Agent is an interactive benchmark framework specifically developed to assess LLM agents in realistic, long-context software engineering settings. Its design extends single-turn code understanding tasks into multi-turn, agentic workflows, enabling systematic evaluation of autonomous agents performing complex development scenarios across extensive codebases. With a standardized set of tools, bias-free metrics, and comprehensive scenario coverage, LoCoBench-Agent establishes a rigorous methodological basis for evaluating adaptive reasoning, tool usage, and long-term context management in LLM-based coding agents (Qiu et al., 17 Nov 2025).
1. Motivation and Design Principles
LoCoBench-Agent addresses the gap left by prior benchmarks—such as LoCoBench—that focus exclusively on single-turn, long-context code understanding. These earlier benchmarks fail to capture the multi-turn, interactive, and tool-driven workflows characteristic of real-world software engineering. LoCoBench-Agent converts 8,000 single-turn code-understanding scenarios into fully interactive development sessions, supporting context lengths from 10K up to 1M tokens. The benchmark rigorously simulates real-world agentic development through multi-phase task decomposition (exploration, planning, implementation, validation) and explicit assessment of error recovery, conversation efficiency, and architectural consistency.
Three initialization modes model real-world information availability:
- Minimal Mode: Limited to README, file tree, and entry points (90% of evaluations).
- Empty Mode: Task specification and project root only.
- Full Mode: Complete codebase provided up front (used for small projects).
This progression enables the evaluation of agents under varying information constraints, emulating practical software engineering environments.
2. Specialized Tool Suite and Agent Interaction
LoCoBench-Agent equips agents with eight sandboxed, IDE-analogue tools accessed via a ReAct-style