Feedback-Driven Execution for LLM-Based Binary Analysis

Published 16 Apr 2026 in cs.CR | (2604.15136v1)

Abstract: Binary analysis increasingly relies on LLMs to perform semantic reasoning over complex program behaviors. However, existing approaches largely adopt a one-pass execution paradigm, where reasoning operates over a fixed program representation constructed by static analysis tools. This formulation limits the ability to adapt exploration based on intermediate results and makes it difficult to sustain long-horizon, multi-path analysis under constrained context. We present FORGE, a system that rethinks LLM-based analysis as a feedback-driven execution process. FORGE interleaves reasoning and tool interaction through a reasoning-action-observation loop, enabling incremental exploration and evidence construction. To address the instability of long-horizon reasoning, we introduce a Dynamic Forest of Agents (FoA), a decomposed execution model that dynamically coordinates parallel exploration while bounding per-agent context. We evaluate FORGE on 3,457 real-world firmware binaries. FORGE identifies 1,274 vulnerabilities across 591 unique binaries, achieving 72.3% precision while covering a broader range of vulnerability types than prior approaches. These results demonstrate that structuring LLM-based analysis as a decomposed, feedback-driven execution system enables both scalable reasoning and high-quality outcomes in long-horizon tasks.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces a feedback-driven execution model that interleaves LLM reasoning, tool interactions, and evidence aggregation for improved binary analysis.
It employs the Dynamic Forest of Agents, hierarchically decomposing tasks to manage context, mitigate search explosion, and validate vulnerabilities.
Empirical results on 3,457 firmware binaries show 72.3% precision and a 2.5x efficiency gain over traditional methods.

Feedback-Driven Execution for LLM-Based Binary Analysis

Motivation and Limitations of One-Pass Paradigms

Traditional binary analysis systems rely primarily on one-pass execution models, constructing static program representations via offline disassembly, CFG recovery, and symbolic analysis. Vulnerability detection then operates atop these fixed structures, limiting adaptability and interactivity. Significantly, this paradigm is inadequate for binaries, which lack high-level metadata, requiring analysts to iteratively reconstruct semantics from low-level artifacts. Existing LLM-based methods have inherited this monolithic approach, failing to interleave reasoning and analysis dynamically.

Figure 1: Execution model comparison: monolithic one-pass execution vs. iterative feedback-driven cycles in binary analysis.

Feedback-Driven Reasoning–Action–Observation Loops

The paper introduces a feedback-driven execution paradigm, wherein LLM agents iteratively reason, act on the binary via tool invocation, observe results, and revise strategies. This closed loop facilitates incremental semantic reconstruction and adaptive exploration. By leveraging this interactive model, agents can evolve hypotheses, form evidence chains, and dynamically select analysis subgoals, overcoming the static limitations of one-pass pipelines.

Scalability Challenges: Reasoning Depth and Breadth

Scaling iterative LLM-driven binary analysis necessitates structural solutions to reasoning depth (extended chains) and breadth (simultaneous multi-path exploration). As the depth increases, LLMs degrade due to context collapse, attention drift, and error accumulation.

Figure 2: Long analysis chains induce context degradation and cumulative errors, undermining deep vulnerability reasoning.

Breadth demands concurrent exploration of complex source-sink combinations, risking context overload and selection bias.

Figure 3: Simultaneous multi-path analysis leads to context overload and difficulties in selection, especially in large binaries.

The FoA Model: Dynamic Forest of Agents

The central innovation is the Dynamic Forest of Agents (FoA) model. FoA decomposes the exploration process hierarchically: each tree is rooted at a taint source, nodes represent agents assigned local sub-tasks, and edges encode task delegation.

Figure 4: FoA execution structure; trees rooted at taint sources, agent nodes, and task decomposition edges.

FoA adapts dynamically—agents are instantiated on demand for recursive subproblems, constraining per-agent context and localizing reasoning. Runtime execution alternates between forward expansion (task decomposition, agent generation) and backward aggregation (structured evidence propagation), yielding verifiable execution traces.

Figure 5: Dynamic agent generation process: recursive delegation of subtasks and runtime instantiation of agent nodes.

Discovery–Validation Integration and Evidence Chains

FoA enables unified workflows for discovery and validation. During discovery, agents traverse taint sources, semantically trace data flows to sinks, and construct provenance chains. These intermediate artifacts are then replayed in evidence-constrained validation runs, ensuring that only actionable vulnerabilities are reported. The evidence chains produced encode fine-grained, stepwise propagation, essential for reproducible exploit verification.

Figure 6: Two-stage FoA workflow for vulnerability discovery and subsequent validation.

Mechanism Analysis and Failure Mode Mitigation

FoA directly addresses three major failure modes in long-horizon binary analysis:

Context Collapse: Through decomposition and bounded local context, FoA prevents deterioration of extended reasoning chains.
Search Explosion: LLM-guided semantic pruning limits exploration to plausible, evidence-driven branches, avoiding wasteful expansion.
Unverified Alerts: Integrated discovery–validation pipelines systematically filter unverifiable candidates, improving actionable yield.
Figure 7: FoA mechanisms mitigate context collapse, search explosion, and lack of systematic validation.

The ablation experiments substantiate these claims: single-agent and sequential-only variants suffer sharp drops in verified vulnerability rates compared to full FoA.

Empirical Evaluation and Numerical Results

On 3,457 real-world firmware binaries, the FoA system achieved:

1,274 vulnerabilities in 591 unique binaries
72.3% precision (validated manually across representative samples)
Broader vulnerability type coverage (6+ CWE categories vs. 2 in traditional tools)

Importantly, FoA’s throughput and robustness were higher: verified vulnerability counts exceeded those of Mango and SaTC, particularly in binaries that traditional methods could not analyze due to path explosion or memory exhaustion. Cost per verified vulnerability improved by a factor of ~2.5 in both time and token efficiency. Scalability analysis indicated long-tail resource requirements, correlated strongly with reasoning depth rather than agent count.

Figure 8: CDFs of reasoning steps and agent counts per binary; resource usage driven by path complexity.

Implications and Future Directions

FoA’s results underscore that execution structure, not merely LLM quality, is a primary determinant of scalable, robust binary analysis. The system’s compositional agent hierarchy, dynamic task delegation, and evidence-driven reasoning constitute a cohesive solution for sustaining long-horizon, multi-path analysis under constrained context. Practically, this architecture enables automated vulnerability discovery and verification on large-scale firmware datasets with reduced manual burden. Theoretically, it introduces a paradigm for agentic reasoning under partial observability, amenable to further extensions—e.g., hybrid symbolic-LLM analysis, adaptive exploration strategies, and integration of fuzzing or formal verification tools.

Conclusion

The paper establishes a principled feedback-driven execution model for LLM-based binary analysis. By tightly interleaving reasoning, tool interaction, and evidence aggregation within a Dynamic Forest of Agents, it overcomes structural bottlenecks of one-pass paradigms, enables scalable exploration, and achieves high validated vulnerability yields across diverse real-world firmware binaries. The FoA architecture provides a robust foundation for future research in automatic code analysis, agentic reasoning, and scalable binary vulnerability detection (2604.15136).

Markdown Report Issue