Analyzing Fuzzing Evaluation Practices: A Methodological Study
The paper "SoK: Prudent Evaluation Practices for Fuzzing" offers a comprehensive analysis of the evaluation practices adopted in fuzzing research papers over the span of six years (2018-2023) in leading computer security and software engineering venues. The research critically examines the adherence of these studies to methodological best practices, mainly benchmarking them against guidelines provided by Klees et al., and further proposes updated recommendations for evaluating fuzzing methodologies.
Overview of Research and Findings
Fuzzing, widely acknowledged for its efficacy in discovering software bugs, relies fundamentally on randomness, which presents unique challenges to the reproducibility and validity of experimental results. The authors investigate 150 fuzzing papers, revealing significant deficiencies in the way these evaluations are often conducted. Specific gaps were noted in the areas of statistical testing, evaluation metrics, seed selection, and the fairness in resource allocation—all crucial for ensuring reproducible and trustworthy results. Furthermore, the research incorporates artifact evaluations for a subset of eight papers to assess the practicality and reproducibility of the claims made in fuzzing studies.
A major observation was the frequent inconsistency or outright neglect of robust statistical methods in reported evaluations: 63 papers used no statistical tests to back their claims. Additionally, only 37 of the studies employed the Mann-Whitney U-test, while 15 did so with insufficient trial repetitions, questioning the robustness of their findings. This lack of statistical rigor directly undermines the reliability of the research outcomes and the perceived improvements over existing methodologies.
Moreover, the paper highlights the disproportionate focus on certain fuzzers, such as AFL and its derivatives, while failing to adequately consider state-of-the-art alternatives. This skew in evaluation subjects not only limits the perceived generalizability of the results but also potentially overlooks the strengths of diverse fuzzing approaches available in contemporary practice.
Another salient issue is the practice surrounding CVE claims. The analysis shows that out of the numerous CVEs reported, a significant number were unverifiable or disputed, raising concerns about the pressure to demonstrate real-world impact through these metrics.
Recommendations and Best Practices
In addressing the identified shortcomings, the paper provides an updated set of guidelines for future fuzzing research:
- Reproducibility and Artifact Sharing: Authors should ensure their research artifacts, including code and experiment configurations, are openly accessible and accompanied by thorough documentation. Participation in artifact evaluation should be encouraged to improve transparency and reproducibility.
- Benchmarking and Target Selection: The paper advises using well-recognized benchmarks and a representative set of evaluation targets that align with the specifics of the technique under assessment.
- Fair Comparison and Seed Selection: It is vital to compare against relevant state-of-the-art fuzzers and employ a transparent and equitable selection of seed sets. Uninformed seeds or multiple seed sets should be used to validate claims effectively.
- Metrics and Statistical Analysis: Fuzzing studies should employ established metrics for evaluation and ensure that statistical tests such as bootstrap or permutation tests are systematically applied, with sufficient trial repetitions to substantiate any performance claims. Effect sizes should also be reported alongside significance tests.
- Documenting Threats to Validity: Explicit attention should be given to articulating possible threats to the validity of the research findings and discussing mitigation strategies within the documentation.
Conclusion
This work is instrumental in highlighting the gaps that exist in the evaluation practices of contemporary fuzzing research and delineates a clear pathway towards more rigorous and reproducible methodologies. These recommendations, if diligently followed, have the potential to significantly bolster the reliability, impact, and scientific contribution of future fuzzing studies. As fuzzing continues to be a pivotal tool in software security, advancing its evaluation practices is vital for ensuring progress and fostering innovation in both industry and academia.