Leveraging LLMs for Fuzzing Deep Learning Libraries
The paper "LLMs are Zero-Shot Fuzzers: Fuzzing Deep-Learning Libraries via LLMs" presents a novel approach, TitanFuzz, designed to enhance the fuzz testing of deep learning (DL) libraries using LLMs as generative engines. This approach addresses the significant challenge of finding bugs in DL libraries such as TensorFlow and PyTorch, which are critical components in modern deep learning systems due to their dense and intricate APIs.
The authors highlight the limitations of traditional fuzzing techniques, emphasizing the need for generated DL programs to satisfy both input language syntax and DL API constraints. TitanFuzz is proposed as the first method to leverage LLMs for automatically generating input programs for fuzzing DL libraries, taking advantage of LLMs' capabilities to understand both the syntax and semantics of programming languages like Python and automatically generate complex and valid DL code sequences.
Key Methodology and Results
TitanFuzz utilizes both generative and infilling LLMs such as Codex and InCoder to perform two main tasks: generating initial seed programs and evolving these seeds through mutation. The generative model is guided by carefully crafted prompts to produce high-quality initial test cases, while the infilling models mutate these cases to explore a vast input space, aiming to maximize test coverage and identify bugs that arise from complex interactions within API sequences.
The experimental evaluation on PyTorch and TensorFlow demonstrates TitanFuzz's capability to achieve substantially higher code coverage and API reach compared to state-of-the-art fuzzers. Specifically, TitanFuzz achieves 30.38% and 50.84% higher code coverage than the best existing fuzzers on TensorFlow and PyTorch, respectively. These results are achieved within a reasonable time frame, considering the complexity and scale of using LLMs in a feedback-driven fuzzing loop.
Strong Numerical Outcomes and Implications
TitanFuzz successfully uncovers 65 bugs in DL libraries, with 41 previously unknown. The paper reports notable findings, including examples of detected bugs that other techniques could not identify, such as complex bugs involving API sequences that interact non-trivially with input data. These results highlight the capability of modern LLMs to provide a comprehensive and diverse range of valid inputs by implicitly learning the constraints and relationships between various DL APIs through extensive corpora exposure.
The implications of this research are manifold. Practically, the successful use of LLMs in this domain suggests a potential to apply this approach to other complex software systems beyond DL libraries, such as compilers, database systems, and SMT solvers. Theoretically, it reinforces the potential of LLMs in software testing and verification tasks, demonstrating their strength in performing tasks traditionally thought to require explicit constraint-solving or complex program analysis.
Conclusion and Future Directions
The authors present a compelling case for the utility of LLMs in fuzzing testing, opening new avenues for research in automated software testing and model-based testing frameworks. Future research might explore refining the methods for prompt engineering, improving the efficiency of the LLM generation processes, and extending these techniques to other domains where legacy fuzz testing is challenging. The application of LLMs in this context not only enhances the robustness of DL library testing but also signals a broader shift towards data-driven approaches in software quality assurance.