Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 69 tok/s

Gemini 2.5 Pro 52 tok/s Pro

GPT-5 Medium 28 tok/s Pro

GPT-5 High 28 tok/s Pro

GPT-4o 75 tok/s Pro

Kimi K2 191 tok/s Pro

GPT OSS 120B 402 tok/s Pro

Claude Sonnet 4.5 34 tok/s Pro

2000 character limit reached

OpenHands-Perf-Agent: Optimizing Performance Bugs

Updated 1 October 2025

OpenHands-Perf-Agent is a performance-focused extension of the OpenHands framework, addressing non-functional, runtime, and resource efficiency bugs in .NET repositories.
It incorporates explicit benchmarking instructions, dynamic microbenchmark generation using BenchmarkDotNet, and efficient output processing to extract key performance metrics.
The agent shows improved success rates on the PerfBench benchmark by rearchitecting algorithms for significant gains in runtime efficiency and memory usage.

OpenHands-Perf-Agent is a performance-focused extension of the OpenHands agent framework for automated bug resolution and code optimization, engineered to target non-functional, runtime and resource efficiency bugs in software repositories. Developed in response to the critical shortcomings of agents in performance bug fixing tasks, OpenHands-Perf-Agent incorporates explicit benchmarking, performance-aware instructions, and robust output processing tools to systematically address inefficiencies, as evaluated in the PerfBench benchmark for .NET repositories (Garg et al., 28 Sep 2025).

1. Benchmark Context and Task Formulation

PerfBench serves as the primary evaluation suite for OpenHands-Perf-Agent, comprising 81 real-world performance bug-fixing tasks extracted from popular GitHub .NET repositories. Each task is anchored in an actual developer-reported performance issue with validated gold patches, accompanied by repository metadata, pre- and post-fix commit hashes, and rigorous manual verification of real performance improvements. The problem formulation strictly prohibits agents from accessing git history, requiring all reasoning and patching to proceed from an isolated “buggy” snapshot of the codebase (Garg et al., 28 Sep 2025).

Distinct from prior functional correctness benchmarks (e.g., SWE-bench), PerfBench mandates evidence of true non-functional improvement. Agents are instructed to not only pass all unit tests, but also generate their own BenchmarkDotNet benchmarks and deliver demonstrable improvements in runtime efficiency, memory allocation, or garbage collection activity.

2. Agent Architecture and Key Enhancements

OpenHands-Perf-Agent derives from the general OpenHands agent platform, which is characterized by modular, step-wise interaction with code, terminal, and browser environments in an isolated sandbox (Docker container) (Wang et al., 23 Jul 2024). Its unique enhancements for performance bug fixing are:

Performance-Aware Instruction Set: The agent is prompted with explicit guidelines for diagnosing, benchmarking, and reasoning about performance-impairing code. It receives examples for crafting appropriate BenchmarkDotNet benchmarks and interpreting their outputs.
Benchmark Generation Module: Rather than relying on pre-existing tests, the agent learns to generate precise microbenchmarks that cover the code region implicated by the bug report. This module enables direct, quantitative assessment of agent-proposed patches.
Output Processing Tooling: Given the often excessive verbosity of BenchmarkDotNet output, custom parsers are introduced that extract only salient summary tables (mean, median, stdev of timings, GC generations, allocations). This ensures performance metrics can fit into the LLM’s context window, reducing token usage over 90% and preserving decision-critical data.
Algorithmic Planning: The agent is guided to reason about common .NET performance anti-patterns (e.g., box allocations, concurrency mishandling, redundant serialization) and exploit classical optimizations (O(n) to O(1) lookup, efficient I/O handling). This is exemplified in a task where CollectionTally.cs is rewritten from O(n) linear search to a hash-based lookup, reducing memory allocations by 84% and CPU time by 70%.

3. Evaluation Methodology and Quantitative Metrics

PerfBench employs an automated harness for reproducible agent evaluation:

Each task is instantiated in a new Docker environment matching the original runtime (SDK version, OS, dependencies).
The agent is provided with the buggy code and the issue description. Using its performance-aware architecture, it proposes both a patch and a custom benchmark.
After the patch is applied, the harness runs unit tests (functional check) and benchmarks before and after the fix.
The patch is deemed successful if at least one quantitative metric (execution time, memory allocation, GC) exhibits improvement and no other metric regresses.

Success rate is defined as:

$\textrm{Success Rate (\%)} = \frac{\textrm{Number of Successful Fixes}}{81} \times 100$

Additional metrics include input/output token usage, step count, and dollar cost per instance, reflecting both computational efficiency and economic viability.

4. Performance Results and Comparative Analysis

The baseline OpenHands agent achieved a success rate of approximately 3–4%, confirming the difficulty of performance bug-fixing relative to functional patching (where success rates on tasks like SWE-bench Verified may exceed 60%) (Garg et al., 28 Sep 2025). With its specialized enhancements, OpenHands-Perf-Agent substantially improves performance, achieving up to 20% success rate (14.8% for GPT-4.1, 19.7% for Claude Sonnet 4). The highest category-specific results occur for algorithmic I/O serialization bugs (up to 33%).

Key factors driving improvement include:

Explicit benchmarking instructions facilitating diagnostic reasoning and verification,
Adaptive use of output processing tools that minimize LLM context overflow,
Iterative refinement that enables more effective patch suggestions.

Despite these gains, success rates remain modest compared to functional bug-fixing, underscoring the intrinsic complexity of non-functional, performance optimization tasks.

5. Technical Innovations and Methodological Impact

OpenHands-Perf-Agent introduces methodological advances in several key areas:

Benchmark autonomy: By dynamically generating benchmarks, it ensures fixes are not tailored solely to pre-existing, possibly misleading or incomplete test coverage.
Quantitative output curation: Token-constrained extraction ensures only the relevant performance metrics inform patch iteration.
Algorithmic planning: The agent now demonstrates capability to restructure algorithms for impactful, measurable efficiency gains, as evidenced in tasks transitioning from linear to constant-time operations.

These features differentiate OpenHands-Perf-Agent from other coding agents evaluated on the same benchmark, both in architecture and empirical effectiveness.

6. Limitations and Prospects for Future Work

The observed limitations of OpenHands-Perf-Agent center on:

Persistent challenges in concurrency, threading, and build-related performance bugs, where semantic diagnosis and architectural change are more involved than in algorithmic or I/O hot spots.
Semantic alignment with developer intent, as the agent can improve micro-benchmarks while failing to address higher-level system constraints or user-facing latency.
Multi-metric trade-offs, since agents often optimize a single dimension at the expense of others or misinterpret performance requirements when non-CPU or non-memory bottlenecks dominate.

Suggested research directions include:

Integration of richer evaluation frameworks that combine quantitative metrics with qualitative alignment checks (e.g., LLM-as-a-judge for patch comparison).
Extension to broader languages and frameworks, beyond .NET, facilitating generalization of results.
Reinforcement learning or iterative feedback loops allowing the agent to refine fixes based on repeated benchmarking cycles.
Enhanced semantic understanding and multi-metric optimization to better capture the nuances of real-world performance engineering.

7. Significance for Agentic Software Engineering

OpenHands-Perf-Agent exemplifies a transition from purely functional to performance-aware agentic software engineering. By explicitly embedding benchmark generation, metric parsing, and algorithmic planning into the agent loop, it addresses a long-standing gap in the automation of non-functional bug resolution. The results demonstrate meaningful progress yet document the substantial headroom for methodological and technical innovation in automated performance optimization for production code (Garg et al., 28 Sep 2025). A plausible implication is that future agent designs should interleave functional correctness and performance benchmarking in a unified reasoning cycle, leveraging both prompt engineering and output processing to systematically close the capability gap.

PDF Markdown Chat (Pro)

References (2)

PerfBench: Can Agents Resolve Real-World Performance Bugs? (2025)

OpenHands: An Open Platform for AI Software Developers as Generalist Agents (2024)

Follow Topic

Get notified by email when new papers are published related to OpenHands-Perf-Agent.