LLAMA: Multi-Feedback Smart Contract Fuzzing

Updated 10 September 2025

The paper introduces a hybrid fuzzing framework combining LLM-guided seed generation, multi-objective feedback, and evolutionary mutation to uncover deep smart contract vulnerabilities.
It employs a multi-stage hierarchical prompting strategy with dynamic mutation scheduling, achieving high branch and instruction coverage with an 89% vulnerability detection rate.
Integration with symbolic execution and adaptive feedback allows LLAMA to effectively target complex, state-dependent flaws in blockchain systems.

A multi-feedback smart contract fuzzing framework, exemplified by LLAMA (Gai et al., 16 Jul 2025), is an advanced security testing infrastructure tailored for the intricate demands of blockchain ecosystems. At its core, LLAMA integrates LLMs, multi-objective runtime feedback, and evolutionary search to systematically uncover both shallow and deep-seated vulnerabilities in smart contract code. The architecture is characterized by tightly coupled modules for seed synthesis, feedback-driven mutation, and adaptive scheduling—each continuously informed by diverse metrics such as semantic coverage, data dependency, abnormal control flow, and runtime exception signals. LLAMA's hybrid design addresses gaps left by previous frameworks that focused narrowly on seed scheduling or singular feedback signals, thereby providing a practical and scalable platform for modern blockchain security assurance.

1. Hierarchical LLM-Guided Seed Generation

LLAMA introduces a five-stage hierarchical prompting strategy for guiding LLMs to generate high-quality initial seeds. The process begins with functional abstraction, where the LLM analyzes the contract's ABI, bytecode, and documentation to summarize function intent, control modifiers, and data dependencies. In the transaction sequence inference step, the model proposes candidate call sequences that are consistent with inter- and intra-contract dataflow. These synthesized inputs are submitted to format verification against the ABI, filtering out infeasible or malformed cases. The LLM's prompting apparatus is then semantically optimized to bias seed generation towards sequences interacting with vulnerability-prone patterns (e.g., reentrancy, overflow). A defining feature is behavior-guided prompt injection: After each round of execution, runtime feedback (e.g., coverage plateau, lack of state transitions) is translated into natural language hints that augment the base prompt, ensuring that subsequent LLM outputs are progressively more focused on unexplored or semantically complex state spaces.

This staged prompt-and-filter approach allows the seed corpus to be both syntactically and semantically rich, outperforming random or purely grammar-based seed construction, and is essential for early penetration of deeper vulnerability logic in highly stateful contracts.

2. Multi-Feedback Optimization and Scheduling

LLAMA applies a tightly integrated, multi-channel feedback mechanism to every phase of the fuzzing campaign—seed selection, seed evolution, and mutation scheduling. During pre-fuzzing selection, each LLM-generated seed is quickly executed to measure coarse metrics such as instruction and branch coverage, as well as exception occurrence. Seeds are then scored and filtered via a dynamic Top-K strategy:

$\text{Score}(s_i) = \text{Coverage}(s_i) + \lambda \cdot \text{Exception}(s_i)$

where $\lambda$ boosts seeds that trigger abnormal executions.

As fuzzing progresses, a composite multi-objective fitness function is applied to each seed $i$ :

$fit(i) = \Delta_\text{branch}(i) + \Delta_\text{inst}(i) + \Delta_\text{RAW}(i)$

with each term measuring gains in new branch hits, new instruction execution, and read-after-write (RAW) dependency coverage.

Beyond seed fitness, LLAMA features dynamic mutation scheduling. Mutation operators are selected using a proportional credit system. Each operator $j$ 's effective contribution to new coverage is updated as:

$fit(j) += \frac{1}{|\mathcal{J}|}\big(\Delta_\text{branch}(i') + \Delta_\text{inst}(i')\big)$

and operator probabilities are perturbed according to observed effectiveness.

By aggregating multiple feedback metrics (coverage, vulnerability signals, dependency depth, and runtime exceptions), LLAMA ensures that mutation energy is neither wasted on low-yield branches nor disproportionately focused on trivial paths. Feedback influences are closed-loop: feedback from seed execution not only informs mutation scheduling but is also injected back into LLM prompt cycles, achieving dynamic and adaptive exploration.

3. Evolutionary and Hybrid Fuzzing Engine

The fuzzing engine in LLAMA is a genetic algorithm combined with selective symbolic execution. Candidate transaction sequences (seeds) are evolved across generations:

Selection: Multi-objective fitness determines which seeds are retained ( $\gamma$ -fraction elitism).
Crossover: Dependency-aware crossover ensures that offspring respect semantic relationships and data dependencies from both parents.
Adaptive Mutation: Operator probabilities are dynamically updated to focus on strategies yielding new behavior.

When progress stalls (coverage growth $<1\%$ ), symbolic execution is invoked. Path constraints are extracted, solved, and resulting inputs are re-injected into the population, breaking through deep or solver-resistant branch guards.

This hybrid design allows LLAMA to efficiently exploit structural regularities where possible (via heuristics and LLM-synthesized seeds), while selectively deploying heavier-weight symbolic reasoning only when necessary, thus managing computational resources and maximizing contract state-space exploration.

4. Empirical Performance and Comparative Evaluation

LLAMA has been evaluated on corpora with thousands of contracts and hundreds of known vulnerabilities. Experimental results show:

Coverage: 91% branch coverage and 90% instruction coverage for small contracts; slightly lower but closely comparable for large contracts (within 10–11%).
Vulnerability Detection: Out of 148 annotated vulnerabilities, LLAMA detected 132 (89% detection rate), significantly outperforming ConFuzzius (71%) and other state-of-the-art fuzzers.
Efficiency: Coverage-over-time plots indicate LLAMA converges to high coverage much faster due to its LLM-guided seed pool and adaptively tuned mutators. While resource utilization is increased by LLM and online feedback, overall overhead remains acceptable, with symbolic execution gated by stagnation criteria.

LLAMA's multi-feedback and hierarchical structure directly address documented inefficiencies in traditional frameworks—particularly those related to random seed mutation, input coverage limitations, and inability to escape local maxima in the contract's state-transition graph.

5. Integration with LLMs and Structured Data Mutation

LLAMA's design is influenced by emerging methods in LLM-guided mutation and structured input handling (Zhang et al., 11 Jun 2024). By treating transaction sequences and ABI-encoded calldata as “structured data”, the framework leverages paired seed fine-tuning, hex-encoding representations, and prompt-based mutation to generate format-preserving, high-diversity seeds. This reduces invalid executions and enables the discovery of vulnerabilities requiring complex interaction patterns or state manipulations that are intractable for traditional bit- or grammar-level mutators.

The inclusion of asynchronous LLM integration—where LLM suggestions are generated and injected in parallel—ensures that the fuzzing campaign maintains throughput while benefiting from periodic semantic “reshaping” of the test input space.

6. Real-World Applications and Implications for Blockchain Security

The LLAMA framework is engineered for practical, real-world deployment in the blockchain ecosystem:

Pre-Deployment Auditing: The ability to generate vulnerability-sensitive, semantically meaningful testcases enables thorough pre-release auditing of high-value contracts where post-deployment patches are infeasible.
Continuous Security Testing: The efficiency and adaptability of the multi-feedback evolutionary engine make continuous integration and regression testing in active DApp development pipelines achievable.
Dynamic Risk Assessment: By integrating runtime behavior metrics (e.g., gas usage anomalies, call graph expansion, abnormal event emission), LLAMA can be used in live monitoring or risk scoring for contracts on public chains.

The approach is extensible as new vulnerability categories emerge, or as contract languages evolve, by updating LLM prompting templates and the primitive set of feedback signals. This adaptability is facilitated by the modular, feedback-centric architecture.

7. Relationship to Prior and Contemporary Research

LLAMA advances beyond prior adaptive fuzzers (e.g., sFuzz (Nguyen et al., 2020)) by generalizing the single “distance-to-branch” signal into a multidimensional feedback vector encompassing dependency and semantic coverage. By integrating symbolic execution only on stagnation, it avoids the state explosion encountered in tools such as Oyente while achieving superior coverage. Contemporary work in LLM-guided vulnerability detection (Ince et al., 12 Jul 2024, Yu et al., 9 Nov 2024, Kevin et al., 17 Feb 2025, Yu et al., 23 Jun 2025) supports the premise that fine-tuned LLMs, when combined with domain-specific feedback and optimization, provide a robust backbone for next-generation fuzzing frameworks.

Furthermore, empirical studies in the field (Wu et al., 5 Feb 2024, Qiao et al., 9 Jun 2025) establish the necessity of multi-signal feedback systems both to maximize true positive rates and to address the practical adoption barriers—such as cross-chain compatibility, usability, and configurability—that LLAMA directly targets with its layered and modular design.

In summary, the LLAMA multi-feedback smart contract fuzzing framework embodies a confluence of LLM-driven seed synthesis, runtime multi-objective optimization, hybrid genetic and symbolic exploration, and adaptive scheduling. This enables systematic, efficient, and deep vulnerability discovery in increasingly complex smart contracts, establishing a foundational paradigm for secure blockchain deployment and operation.