Evaluating LLM Agents on Automated Software Analysis Tasks

Published 13 Apr 2026 in cs.SE | (2604.11270v2)

Abstract: Numerous software analysis tools exist today, yet applying them to diverse open-source projects remains challenging due to environment setup, dependency resolution, and tool configuration. LLM-based agents offer a potential solution, yet no prior work has systematically studied their effectiveness on the specific task of automated software analysis, which, unlike issue solving or general environment setup, requires installing and configuring a separate analysis tool alongside the target project, generating tool-specific prerequisites, and validating that the tool produces meaningful analysis outputs. We introduce AnalysisBench, a benchmark of 35 tool-project pairs spanning seven analysis tools and ten diverse C/C++ and Java projects, each with a manually constructed reference setup. Using AnalysisBench, we evaluate four agent architectures across four LLM backends. Our custom agent, AnalysisAgent, achieves manually verified success rates of 94% (Gemini-3-Flash, 33/35 tasks), compared to 77% for the best baseline (ExecutionAgent). Beyond quantitative results, we identify key limitations in existing agents, including stage mixing, poor error localization, and premature termination, and show that agentic architecture matters more than LLM capability alone. We further find that whole-program analyses and symbolic execution are the most difficult tasks, that Java toolchains pose greater challenges than C/C++, and that LLM-self-validated success consistently overstates manually verified success.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper demonstrates that robust, task-specific LLM agent scaffolding achieves a 94% verified success rate across 35 diverse software analysis tasks.
It systematically compares four agent architectures, showing that structured decomposition and evidence-based validation outperform unstructured LLM outputs.
It highlights significant efficiency gains, with lower costs and reduced execution times, emphasizing the role of adaptive agent designs in complex environments.

Evaluating LLM Agents on Automated Software Analysis Tasks

Problem Context and Task Formulation

Automated software analysis—encompassing static/dynamic analyzers, symbolic execution, fuzzers, and profilers—plays a crucial role in contemporary software lifecycles. However, practical deployment of these tools at scale is hindered by burdensome environment setup, dependency mangling, and nontrivial configuration for both tools and diverse target codebases. The work introduces a rigorous benchmark, Analysis, comprising 35 tool–project pairs cutting across seven nontrivial C/C++ and Java analyzers and ten open-source projects, demanding that an agent autonomously provisions a reproducible container, installs and configures the chosen tool, builds and prepares the target project, runs the analysis, and produces definitive, verifiable evidence of meaningful tool output.

Agent Architectures and Evaluation Design

Four agent architectures are systematically evaluated:

RAG-Agent: Synthesis-driven, retrieval-augmented via web resources, generates aggregate build and analysis scripts upfront.
Mini-SWE-Agent: Iterative ReAct-style agent cycling reasoning and shell/file actions for incremental debugging.
ExecutionAgent: Emphasizes robust environment construction and deferred, in-container execution.
Analysis (proposed): Task-specific, integrating explicit staged decomposition (Docker setup, tool, project, analysis), single-action cycles with deterministic log condensation, and evidence-based, LLM-as-judge-validated success checks.

All agents are tested with identical per-task budgets (max 120 cycles, $5$ minutes, \$2 API expense), and four commercial/proprietary LLM backends of varying context and price points (GPT-5-nano/mini, DeepSeek-V3.2, Gemini-3-Flash) over the full 35-task benchmark.

Quantitative Findings

Analysis agent, with Gemini-3-Flash or DeepSeek-V3.2, achieves a manually verified success rate of 94% (33/35), outperforming the best baseline (ExecutionAgent+Gemini-3-Flash, 77%) by a substantial margin.

Figure 2: Analysis is consistently more likely to succeed than all baselines, with odds ratios substantially exceeding 1 across all evaluated LLMs and in aggregate.

Effect size analysis underscores this advantage: odds ratios of 34.5, 8.1, and 2.7 over RAG-Agent, Mini-SWE-Agent, and ExecutionAgent respectively (all $p_{\text{adj}} < 0.001$ , Holm-Bonferroni), with results robust across LLM choice. Notably, model choice alone does not suffice for high end-to-end reliability—agentic scaffolding exerts greater influence on verified outcomes than raw LLM capability.

Crucially, self-validated success rates are systematically inflated across all baselines. For instance, Mini-SWE-Agent and RAG-Agent often report >90% automated success, but manual inspection rates are as low as 9–37%. The Analysis agent's validator reduces—but does not eliminate—this gap, underscoring the necessity of extrinsic, reference-based validation for robust assessment in complex software analysis workflows.

Analysis by Tool, Language, and Failure Mode

Success and failure rates vary strongly as a function of the analysis tool and target ecosystem. Java tasks and whole-program/symbolic analyzers register higher failure and resource consumption rates, with WALA and Infer dominating the residual failure fraction.

Figure 1: Success rates by tool and agent—Analysis closely approaches 100% for most static analyzers; hardest targets are whole-program Java analyses.

Root-cause analysis of failed trajectories reveals distinct clusters:

RAG-Agent & ExecutionAgent: Predominantly fail due to Docker/build failures and invalid LLM outputs (e.g., unparseable commands, stalling on repeated unresolved errors).
Mini-SWE-Agent: Fails primarily due to tool misuse and production of semantically invalid or trivial outputs.
Analysis: Remaining failures center on tool-specific invocation errors or unsatisfied complex dependencies, not infrastructure setup.
Figure 3: Distribution of failure categories highlights architectural strengths and agent-specific weaknesses—Analysis nearly eliminates environment setup failures but faces challenges in deep tool integration.

Efficiency and Resource Dynamics

Resource consumption analysis indicates that efficiency is non-monotonic in model price: weaker models (e.g., GPT-5-nano) perform more cycles and take longer due to missteps, up to $4\times$ more than higher capability models, effectively negating lower per-token costs. Failed runs are 2.77 $\times$ more expensive and 4.07 $\times$ longer than successful ones, further amplifying the case for robust agentic scaffolding.

Figure 4: Resource consumption as a function of agent and LLM backend—Analysis yields lower mean cost and wall-clock time due to higher success rates.

Qualitative Insights and Output Depth

Despite constrained dynamic analysis budgets (30–180s per dynamic tool run), successful agent runs produce diverse, correct outputs ranging from nontrivial coverage and test cases (AFL++, KLEE) to full call graphs (WALA) and structured static bug reports (CSA, Infer). However, for dynamic tools, coverage remains shallow due to runtime limits—improvements in tool invocation robustness could enable scaling to deeper analysis in automated workflows.

Implications and Future Directions

Practical Implications: Analysis demonstrates that, with robust agentic scaffolding, LLM-powered automation can reliably orchestrate complex tool-plus-project tasks, rendering integration into CI/CD viable at cost <$1/task$ (for top-performing models). However, evidence validation remains essential, as LLMs tend to overreport superficial success in the absence of structured checks.

Theoretical and Methodological Implications: The findings reinforce that general-purpose LLM architectures, even at scale, underperform without task-specific structure, particularly when agentic planning, staged execution, and domain-tailored validation are essential. This supports a research direction favoring compositional, modular agent designs over pure prompt scaling for environment-intensive tasks.

Avenues for Future Work:

Expansion of the Analysis benchmark to include additional languages, non-command-line/library targets, and commercial/closed-source analysis tools to further stress-test agent generality and portability.
Development of adaptive agent strategies capable of early detection of failure modes (e.g., compile-fix loops, validation failures) and dynamic budget reallocation.
Controlled studies benchmarking agent-assisted versus manual setup to empirically quantify developer productivity impact.
Integration of self-reflective pipelines and lifelong learning to mitigate repeated errors in long-horizon action sequences.

Conclusion

The work establishes a clear methodology for evaluating, comparing, and improving LLM-agent architectures in the context of automated software analysis. It demonstrates that purpose-built, scaffolded agents with robust, evidence-based validation can solve multi-stage, environment-intensive tasks at near-human reliability, whereas unstructured prompting or model scaling alone is insufficient. This provides a blueprint for practical deployment of autonomous agents in software engineering and suggests fertile ground for further research on agent design principles in complex automation domains (2604.11270).

Markdown Report Issue