Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

102 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

43 tokens/sec

o3 Pro

6 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

125 1

SoK: Prudent Evaluation Practices for Fuzzing (2405.10220v1)

Published 16 May 2024 in cs.SE and cs.CR

Abstract: Fuzzing has proven to be a highly effective approach to uncover software bugs over the past decade. After AFL popularized the groundbreaking concept of lightweight coverage feedback, the field of fuzzing has seen a vast amount of scientific work proposing new techniques, improving methodological aspects of existing strategies, or porting existing methods to new domains. All such work must demonstrate its merit by showing its applicability to a problem, measuring its performance, and often showing its superiority over existing works in a thorough, empirical evaluation. Yet, fuzzing is highly sensitive to its target, environment, and circumstances, e.g., randomness in the testing process. After all, relying on randomness is one of the core principles of fuzzing, governing many aspects of a fuzzer's behavior. Combined with the often highly difficult to control environment, the reproducibility of experiments is a crucial concern and requires a prudent evaluation setup. To address these threats to validity, several works, most notably Evaluating Fuzz Testing by Klees et al., have outlined how a carefully designed evaluation setup should be implemented, but it remains unknown to what extent their recommendations have been adopted in practice. In this work, we systematically analyze the evaluation of 150 fuzzing papers published at the top venues between 2018 and 2023. We study how existing guidelines are implemented and observe potential shortcomings and pitfalls. We find a surprising disregard of the existing guidelines regarding statistical tests and systematic errors in fuzzing evaluations. For example, when investigating reported bugs, ...

PDF HTML Abstract

Analyzing Fuzzing Evaluation Practices: A Methodological Study

The paper "SoK: Prudent Evaluation Practices for Fuzzing" offers a comprehensive analysis of the evaluation practices adopted in fuzzing research papers over the span of six years (2018-2023) in leading computer security and software engineering venues. The research critically examines the adherence of these studies to methodological best practices, mainly benchmarking them against guidelines provided by Klees et al., and further proposes updated recommendations for evaluating fuzzing methodologies.

Overview of Research and Findings

Fuzzing, widely acknowledged for its efficacy in discovering software bugs, relies fundamentally on randomness, which presents unique challenges to the reproducibility and validity of experimental results. The authors investigate 150 fuzzing papers, revealing significant deficiencies in the way these evaluations are often conducted. Specific gaps were noted in the areas of statistical testing, evaluation metrics, seed selection, and the fairness in resource allocation—all crucial for ensuring reproducible and trustworthy results. Furthermore, the research incorporates artifact evaluations for a subset of eight papers to assess the practicality and reproducibility of the claims made in fuzzing studies.

A major observation was the frequent inconsistency or outright neglect of robust statistical methods in reported evaluations: 63 papers used no statistical tests to back their claims. Additionally, only 37 of the studies employed the Mann-Whitney U-test, while 15 did so with insufficient trial repetitions, questioning the robustness of their findings. This lack of statistical rigor directly undermines the reliability of the research outcomes and the perceived improvements over existing methodologies.

Moreover, the paper highlights the disproportionate focus on certain fuzzers, such as AFL and its derivatives, while failing to adequately consider state-of-the-art alternatives. This skew in evaluation subjects not only limits the perceived generalizability of the results but also potentially overlooks the strengths of diverse fuzzing approaches available in contemporary practice.

Another salient issue is the practice surrounding CVE claims. The analysis shows that out of the numerous CVEs reported, a significant number were unverifiable or disputed, raising concerns about the pressure to demonstrate real-world impact through these metrics.

Recommendations and Best Practices

In addressing the identified shortcomings, the paper provides an updated set of guidelines for future fuzzing research:

Reproducibility and Artifact Sharing: Authors should ensure their research artifacts, including code and experiment configurations, are openly accessible and accompanied by thorough documentation. Participation in artifact evaluation should be encouraged to improve transparency and reproducibility.
Benchmarking and Target Selection: The paper advises using well-recognized benchmarks and a representative set of evaluation targets that align with the specifics of the technique under assessment.
Fair Comparison and Seed Selection: It is vital to compare against relevant state-of-the-art fuzzers and employ a transparent and equitable selection of seed sets. Uninformed seeds or multiple seed sets should be used to validate claims effectively.
Metrics and Statistical Analysis: Fuzzing studies should employ established metrics for evaluation and ensure that statistical tests such as bootstrap or permutation tests are systematically applied, with sufficient trial repetitions to substantiate any performance claims. Effect sizes should also be reported alongside significance tests.
Documenting Threats to Validity: Explicit attention should be given to articulating possible threats to the validity of the research findings and discussing mitigation strategies within the documentation.

Conclusion

This work is instrumental in highlighting the gaps that exist in the evaluation practices of contemporary fuzzing research and delineates a clear pathway towards more rigorous and reproducible methodologies. These recommendations, if diligently followed, have the potential to significantly bolster the reliability, impact, and scientific contribution of future fuzzing studies. As fuzzing continues to be a pivotal tool in software security, advancing its evaluation practices is vital for ensuring progress and fostering innovation in both industry and academia.

PDF Markdown Bookmark Chat (Pro)

References (192)

Authors (10)

Moritz Schloegel (5 papers)
Nils Bars (3 papers)
Nico Schiller (3 papers)
Lukas Bernhard (5 papers)
Tobias Scharnowski (1 paper)
Addison Crump (2 papers)
Arash Ale Ebrahim (2 papers)
Nicolai Bissantz (13 papers)
Marius Muench (1 paper)
Thorsten Holz (52 papers)

Citations (9)

View on Semantic Scholar

Tweets

https://twitter.com/thorstenholz/status/1792588642369855653

https://twitter.com/masami256/status/1802861836100075957

https://twitter.com/ComputerPapers/status/1791387828242289077

SoK: Prudent Evaluation Practices for Fuzzing (Paper, 2024.05.16) (1 point, 0 comments)