Repository-Level Performance Optimization
- Repository-level performance optimization is a process of refining entire codebases by synthesizing patches to boost runtime efficiency while ensuring correctness.
- It employs benchmark-driven evaluation, dynamic profiling, and empirical tests to guide cross-file improvements such as algorithmic enhancements and data structure upgrades.
- Challenges include effective hot-path localization, scalability in large repositories, and integrating LLM-driven methods with long-horizon planning.
Repository-level performance optimization is the systematic process of improving runtime efficiency, resource utilization, or throughput for an entire software repository, while maintaining functional correctness across the codebase. Unlike local or file-level optimization, this domain involves understanding cross-file dependencies, complex execution paths, and system-level workloads. Recent research demonstrates both the complexity and urgency of this problem in light of the growing size of codebases and the emergence of LLM-driven software engineering agents. This article surveys the state of repository-level performance optimization, emphasizing benchmark-driven evaluation, LLM capabilities, algorithmic strategies, empirical results, and structural limitations in current approaches.
1. Problem Definition and Task Scope
Repository-level performance optimization is formally defined by providing agents with a complete code repository , a set of performance-related workloads, and correctness test oracles. The optimization objective is to synthesize a patch (or sequence of edits) such that, when applied to , the mean execution time of the workload is minimized (or meets a prescribed speedup threshold), while all designated correctness tests continue to pass. This is captured by the following constrained optimization:
where denotes mean runtime on the performance workload after applying patch . Both SWE-fficiency (Ma et al., 8 Nov 2025) and SWE-Perf (He et al., 16 Jul 2025) operationalize this definition by extracting challenging tasks from real-world repositories, focusing on "how-to-fix" (synthesis of non-trivial, multi-function patches) as opposed to mere localization or bug identification.
The optimization targets in these benchmarks are heterogeneous, including but not limited to algorithmic improvments (e.g., ), data structure replacement, vectorization, parallelization, and I/O batching. Repository-level denotes the freedom to edit any number of files and functions implicated in the workload trace, in contrast to the "oracle" or "file-level" setting where models may touch only a pre-identified region.
2. Benchmark Construction and Evaluation Protocols
Modern benchmarks for repository-level performance optimization are built from real merged pull requests (PRs) that provably improve workload runtimes while maintaining correctness. The construction protocol, as instantiated in SWE-fficiency (Ma et al., 8 Nov 2025) and SWE-Perf (He et al., 16 Jul 2025), adheres to the following pipeline:
- PR Mining and Attribution: Filter all merged PRs in the repository history by keyword heuristics (e.g., "perf", "speedup"), AST differencing (to exclude non-semantic or doc-only changes), and ignore test-only modifications.
- Workload Extraction: For each candidate PR, extract or author a reproducible workload script (e.g., using
timeit) that exercises the targeted performance path. - Coverage and Regression Analysis: Run unit tests with coverage tooling to associate PR-edited lines with relevant tests. Only PRs where at least one test covers the diff are retained.
- Performance Verification: Run workload pre- and post-patch in a pinned, controlled environment (e.g., 1 CPU core, 16 GB RAM, Docker container). Retain only those PRs where the measured speedup exceeds a minimum threshold (e.g., in SWE-Perf).
- Statistical Validation: Use repeated runs and nonparametric tests (e.g., Mann–Whitney ) to ensure speedup claims are robust. Apply strong outlier rejection to avoid performance flukes.
Each benchmark instance includes: (i) the full pre- and post-patch codebases, (ii) identified workloads, (iii) a set of unit tests guarding correctness, (iv) expert-authored or "gold" performance patch, and (v) runtime metadata.
Metrics used in these benchmarks are multi-tiered:
- Application success: Whether a candidate patch is cleanly applicable.
- Correctness: Whether all required tests pass after patching.
- Performance: Ratio or percentage improvement, typically measured as harmonic mean of per-instance speedups:
where "gold" refers to the expert patch.
These protocols ensure a high bar for both reproducibility and rigor, with datasets spanning scientific computing, ML, and data analysis projects.
3. LLM-based and Agent-based Optimization Strategies
Recent work has explored both agentless and agent-based LLM methods for repository-level optimization.
- File-level (Oracle) Setting: Here, candidate models receive the contents of precisely those files and functions modified by the human expert. Prompts include function signatures and explicit instructions to preserve correctness while improving runtime. Models such as Claude-3.7-Sonnet, GPT-4o, and Gemini-2.5-Pro are evaluated in single-pass chain-of-thought settings (He et al., 16 Jul 2025).
- Repo-level, Agentless Approach ([Xia et al., (He et al., 16 Jul 2025)]):
- Localization: Use dynamic (profiling) or static (call-graph) analysis to rank suspicious/hot files and functions.
- Generation: Apply LLMs to synthesize optimization candidates on the ranked regions.
- Test-driven Ranking: For each candidate, apply the patch, rebuild, rerun performance workloads and tests, and select the most effective patch that passes all correctness oracles.
Repo-level, Agent-based Approach (OpenHands [Wang et al., (Ma et al., 8 Nov 2025)]):
- Multi-agent simulations (developer + tester) iterate up to 50 cycles, reasoning over the file tree, issuing edits, rerunning tests, and keeping state (prior patches). Claude 3.7-Sonnet is a typical LLM backend. Agents search across files, using retrieval to fetch context, and maintain a sandboxed build/test loop.
Empirical results indicate a substantial performance gap between LLM-driven agents and expert solutions. For example, the best agent-based methods (OpenHands) achieve only 2.26% average performance gain on SWE-Perf, in contrast to the expert's 10.85% (He et al., 16 Jul 2025); similarly, top agents achieve only 0.007–0.15× of expert speedup on SWE-fficiency (Ma et al., 8 Nov 2025).
4. Representative Optimization Techniques and Task Taxonomy
Analysis of human-authored and LLM-generated patches reveals a taxonomy of repository-level performance improvements (He et al., 16 Jul 2025):
- Algorithmic replacements: to rewrites, divide-and-conquer routines.
- Data-structure upgrades: List set/dict (to exploit membership), memoization.
- Vectorization/bulk operations: Hand-written loops replaced with high-throughput library calls (e.g., NumPy, pandas).
- Parallelization: Employ multiprocessing or thread pools to exploit independent subtasks.
- I/O optimizations: Batching, buffering, or streamlining disk/network operations.
Typical LLM patches preferentially select "local" fixes: e.g., trivial memoization or syntax-level tweaks, with deep refactorings (cross-module, vectorized, or parallel solutions) being exceedingly rare. Human experts make larger-scale changes, especially on longer-running workloads or tasks with more complex bottlenecks.
The most common agent errors include failure to edit the "hot" path, overfitting to a single test, and attempting superficial optimizations that yield only slight improvement.
5. Capability Gaps, Failure Modes, and Scaling Trends
Evaluation of LLM-based repository-level optimization agents reveals multiple persistent difficulties:
- Localization Failure: Agents often fail to identify the core bottleneck, particularly when the hot function is several levels deep in the call graph. In SWE-fficiency, over 68% of expert speedup mass comes from functions never edited by the LLM agent (Ma et al., 8 Nov 2025).
- Superficiality/Satisficing: "Low-hanging fruit" optimizations (early-exit checks, micro-caches) predominate, with few cases of deeper algorithmic restructuring or high-level domain-specific abstraction.
- Long-horizon Planning: LLMs rarely execute full cycles of profile–patch–rebuild–reprofile–test, and often halt after the first small observed speedup.
- Correctness Maintenance: Test selection and evaluation can drift, e.g., by running too many or too few tests, or failing to rebuild C/Cython extensions.
- Generalization and Overfitting: In some cases, agents inject hardcoded shapes/parameters that pass a single test trace but do not generalize.
- Scalability: As the number of functions or code size increases, both agent effectiveness and performance gains drop sharply.
Empirically, unit test regression rates for leading LLM agents can reach 18–45% (Ma et al., 8 Nov 2025). Mean speedup ratios for agent patches decrease as tasks demand larger edits, longer pre-patch runtimes, or higher gold speedups.
6. Research Directions and Recommendations
Current agent gaps suggest several promising avenues for future research:
- Integration of Static and Dynamic Analysis: Incorporating call graph extraction, type propagation, and dynamic profiling can better inform the region of interest for patch synthesis and help localize optimization opportunities (Ma et al., 8 Nov 2025, He et al., 16 Jul 2025).
- Hierarchical Reasoning Architectures: Decoupling bottleneck localization from patch generation—e.g., first finding hot-paths, then synthesizing optimizations—may mitigate satisficing and enable deeper improvements.
- Test-Guided Synthesis: Co-optimization for correctness and performance, possibly with in-harness feedback loops and specialized profiling primitives.
- Prompt Engineering: Few-shot human-expert exemplars categorized by optimization class, chain-of-thought reasoning templates that explicitly trace computational complexity, and retrieval-augmented chaining to assist cross-file modifications.
- Human-in-the-Loop Hybrid Systems: Enabling human selection or ranking among LLM proposals, and "expert advisors" models trained on small sets of gold patches per repository.
- Scaling Benchmarks: Extending repository-level optimization tasks to GPU-bound, I/O-bound, and multi-language scenarios, and embedding continuous performance regression checks in CI workflows.
7. Broader Implications and Connections
Repository-level performance optimization represents a significant, unsolved challenge for automated software engineering. As evidenced by large-scale benchmarks, current LLMs underperform experienced human engineers by an order of magnitude in mean workload speedup on realistic tasks (0.007–0.15× of expert speedup in SWE-fficiency; 2.26% vs. 10.85% mean improvement in SWE-Perf) (Ma et al., 8 Nov 2025, He et al., 16 Jul 2025). Key obstacles include cross-file reasoning, long-horizon planning, and the synthesis of complex, system-level refactorings.
A plausible implication is that progress on repository-level optimization will require hybrid architectures that integrate powerful retrieval mechanisms, analytic tools for code analysis and localization, and improved mechanisms for test-driven program synthesis. Advances in this domain have the potential to enable LLMs and code agents to participate meaningfully in production-scale software performance engineering, with implications for the broader adoption of AI in software maintenance and evolution.