- The paper introduces PRIMG, a framework that combines ML-based mutant prioritization with iterative LLM-driven test generation to improve test effectiveness for Solidity contracts.
- The mutation prioritization module uses Test Completeness Advancement Probability to target high-impact mutants, reducing redundancy and computational overhead.
- Experimental results on real-world Solidity projects show increased mutation scores and efficient, compact test suite generation.
Efficient LLM-driven Test Generation: The PRIMG Approach
Introduction
The paper presents PRIMG (Prioritization and Refinement Integrated Mutation-driven Generation), a framework designed to enhance the efficiency of test case generation for Solidity smart contracts. It addresses the computational overhead and redundancy associated with mutation testing by employing a novel combination of mutation prioritization and LLMs for generating and refining test cases.
PRIMG Framework Components
PRIMG integrates two core modules: mutation prioritization and test case generation. The mutation prioritization module predicts the utility of surviving mutants using machine learning models trained on mutant subsumption graphs, enabling developers to target high-impact mutants. The test case generation module utilizes LLMs to produce and iteratively refine tests that ensure syntactic and behavioral correctness.
Mutation Prioritization Module
The module employs Test Completeness Advancement Probability (TCAP) to prioritize mutants based on their potential to reveal additional errors when targeted. TCAP leverages the structural information of Dynamic Mutant Subsumption Graphs (DMSGs) to predict a mutant's usefulness, enhancing test effectiveness and minimizing test suite redundancy.
Test Case Generation Module
The generation process begins by creating an initial prompt that includes the Program Under Test (PUT), the mutant code, and a reference test file with diverse testing patterns. This prompt guides the LLM in generating an initial unit test, which is then refined through a syntax and behavior verification loop to ensure correctness.
Figure 1: The proposed approach for generating a test case using LLMs.
Experimental Evaluation
The evaluation of PRIMG was conducted on real-world Solidity projects from Code4Arena, demonstrating significant improvements in mutation scores while maintaining a compact test suite size. The effectiveness of PRIMG in reducing computational overhead and generating high-quality tests was assessed using the following criteria:
- Test Case Correctness: The refining process significantly increased the syntactic and behavioral correctness rates of generated tests compared to single-shot prompts. The paper found a marked improvement when using a refining loop of five iterations, with no substantial gain observed beyond this point.
- Performance of Prioritization Module: The prioritization module outperformed random mutant selection by consistently targeting high-impact mutants, resulting in an increased number of killed mutants and reducing redundant test efforts.
Figure 2: Overview of the dataset labeling process.
Implementation Details
The LLM used in this framework is a fine-tuned version of Llama 3.1 configured for effective test case generation and refinement. The mutation testing employed SuMo, a dedicated mutation testing tool for Solidity smart contracts, to generate and test mutants across various operators. Testing environments were established using the Hardhat framework and Ganache for deploying and executing contracts.
Developers executing this framework should ensure the correct setup of testing and deployment environments and consider training the machine learning models using project-specific data to maximize precision in mutation prioritization.
Conclusion
PRIMG demonstrates a scalable and efficient solution for test case generation by integrating sophisticated ML-based mutant prioritization with automated LLM-driven test generation and refinement. This dual approach not only optimizes test suites in size and computational efficiency but also enhances the capability to detect complex errors, thereby improving software quality significantly.
Figure 3: Number of test cases by trial and projects.