Evaluating Fuzz Testing (1808.09700v2)

Published 29 Aug 2018 in cs.CR

Abstract: Fuzz testing has enjoyed great success at discovering security critical bugs in real software. Recently, researchers have devoted significant effort to devising new fuzzing techniques, strategies, and algorithms. Such new ideas are primarily evaluated experimentally so an important question is: What experimental setup is needed to produce trustworthy results? We surveyed the recent research literature and assessed the experimental evaluations carried out by 32 fuzzing papers. We found problems in every evaluation we considered. We then performed our own extensive experimental evaluation using an existing fuzzer. Our results showed that the general problems we found in existing experimental evaluations can indeed translate to actual wrong or misleading assessments. We conclude with some guidelines that we hope will help improve experimental evaluations of fuzz testing algorithms, making reported results more robust.

Authors (5)

George Klees (1 paper)
Andrew Ruef (10 papers)
Benji Cooper (1 paper)
Shiyi Wei (7 papers)
Michael Hicks (31 papers)

Citations (571)

View on Semantic Scholar

Summary

Evaluating Fuzz Testing: A Critical Analysis

The paper "Evaluating Fuzz Testing," presented during the 2018 ACM SIGSAC Conference on Computer and Communications Security, critically examines the experimental methodologies used in recent fuzz testing research. Despite the acknowledged efficacy of fuzz testing in identifying security-critical bugs, the authors highlight significant inconsistencies in evaluating new fuzzing techniques. This essay will provide an expert review of the paper, exploring its findings, implications, and potential directions for future research.

Overview of Fuzz Testing

Fuzz testing is a dynamic testing technique that generates and executes random input data to identify software vulnerabilities. The methodology of fuzzing has evolved to include various approaches, such as black-box, grey-box, and white-box fuzzing, each with distinct strategies for input generation and code coverage analysis. Despite its simplicity, fuzzing has been instrumental in identifying numerous bugs across diverse software systems.

Issues in Current Fuzzing Evaluations

The paper identifies key deficiencies in existing fuzzing evaluations through a detailed survey of 32 recent studies. These deficiencies include:

Insufficient Trials: Many studies do not perform multiple trials, leading to potential distortions due to the inherent randomness of fuzzing processes. Even when trials are performed, variance is often not adequately analyzed using statistical methods.
Lack of Ground Truth: Evaluations frequently use heuristics like AFL's coverage profiling or stack hashing to estimate the uniqueness of bugs, which can lead to substantial overcounting.
Inadequate Timeout Settings: Short timeouts are common, potentially overlooking the varied bug-finding performance of fuzzers over extended periods.
Seed Selection Ambiguities: The choice and documentation of seed inputs often lack clarity, despite their significant impact on fuzzing outcomes.
Diverse Benchmark Programs: A lack of standardized benchmark suites complicates cross-paper comparisons, leading to varied conclusions about new techniques' effectiveness.

Experimental Evaluation

The authors present an extensive experimental evaluation using AFL and AFLFast, illustrating how misalignments in evaluation protocols could result in misleading conclusions. Their analysis reveals:

Performance Variability: Results of fuzzing are heavily influenced by randomness, emphasizing the need for statistical tests like the Mann-Whitney U-test to validate performance differences.
Seed Dependency: Different seeds can dramatically alter the number of unique crashing inputs, suggesting that evaluations should explore multiple seed configurations.
Timeout Considerations: Longer durations may reveal performance trends obscured in shorter evaluations, thus providing a more comprehensive picture of a fuzzer's capabilities.
Heuristic Over-reliance: Artificially induced coverage measures lead to incorrect assessments of bug-finding effectiveness, as both overestimations and false negatives occur.

Future Directions and Recommendations

The paper proposes recommendations to enhance the rigor of fuzz testing evaluations:

Statistical Rigor: Employ multiple trials and appropriate statistical analyses to ensure reliability in performance comparisons.
Ground Truth Utilization: Prioritize evaluations using programs with known bugs or synthetic benchmarks like CGC and LAVA-M to establish ground truth.
Comprehensive Benchmark Suites: Develop and adopt robust benchmark suites that reflect a wide range of program characteristics, fostering more generalizable findings.
Seed Exploration: Clearly document and vary seed inputs, including the use of empty seeds, to account for differences in fuzzers' interactions with inputs.
Extended Timeouts: Consider longer evaluation times to fully explore fuzzers' capabilities in different scenarios.

Implications and Conclusion

Evaluating fuzz testing is crucial for developing more effective vulnerability detection methods. The paper underscores the need for methodological rigor in establishing the credibility of new fuzzing techniques. By adhering to the recommendations provided, future research can produce more reliable and comparable results, thereby advancing the understanding and application of fuzz testing in software security.

As fuzz testing continues to evolve, addressing these methodological concerns will ensure that contributions are scientifically robust, paving the way for innovations that solidify fuzz testing as a fundamental component of secure software development practices.

PDF Markdown

Related Papers

Find Related Papers

HackerNews

Evaluating Fuzz Testing (7 points, 2 comments)