Large Language Models are Zero-Shot Fuzzers: Fuzzing Deep-Learning Libraries via Large Language Models (2212.14834v4)

Published 30 Dec 2022 in cs.SE

Abstract: Detecting bugs in Deep Learning (DL) libraries (e.g., TensorFlow/PyTorch) is critical for almost all downstream DL systems in ensuring effectiveness/safety for end users. Meanwhile, traditional fuzzing techniques can be hardly effective for such a challenging domain since the input DL programs need to satisfy both the input language (e.g., Python) syntax/semantics and the DL API input/shape constraints for tensor computations. To address these limitations, we propose TitanFuzz - the first approach to directly leveraging LLMs to generate input programs for fuzzing DL libraries. LLMs are titanic models trained on billions of code snippets and can auto-regressively generate human-like code snippets. Our key insight is that modern LLMs can also include numerous code snippets invoking DL library APIs in their training corpora, and thus can implicitly learn both language syntax/semantics and intricate DL API constraints for valid DL program generation. More specifically, we use both generative and infilling LLMs (e.g., Codex/InCoder) to generate and mutate valid/diverse input DL programs for fuzzing. Our experimental results demonstrate that TitanFuzz can achieve 30.38%/50.84% higher code coverage than state-of-the-art fuzzers on TensorFlow/PyTorch. Furthermore, TitanFuzz is able to detect 65 bugs, with 41 already confirmed as previously unknown bugs. This paper demonstrates that modern titanic LLMs can be leveraged to directly perform both generation-based and mutation-based fuzzing studied for decades, while being fully automated, generalizable, and applicable to domains challenging for traditional approaches (such as DL systems). We hope TitanFuzz can stimulate more work in this promising direction of LLMs for fuzzing.

PDF Abstract

Leveraging LLMs for Fuzzing Deep Learning Libraries

The paper "LLMs are Zero-Shot Fuzzers: Fuzzing Deep-Learning Libraries via LLMs" presents a novel approach, TitanFuzz, designed to enhance the fuzz testing of deep learning (DL) libraries using LLMs as generative engines. This approach addresses the significant challenge of finding bugs in DL libraries such as TensorFlow and PyTorch, which are critical components in modern deep learning systems due to their dense and intricate APIs.

The authors highlight the limitations of traditional fuzzing techniques, emphasizing the need for generated DL programs to satisfy both input language syntax and DL API constraints. TitanFuzz is proposed as the first method to leverage LLMs for automatically generating input programs for fuzzing DL libraries, taking advantage of LLMs' capabilities to understand both the syntax and semantics of programming languages like Python and automatically generate complex and valid DL code sequences.

Key Methodology and Results

TitanFuzz utilizes both generative and infilling LLMs such as Codex and InCoder to perform two main tasks: generating initial seed programs and evolving these seeds through mutation. The generative model is guided by carefully crafted prompts to produce high-quality initial test cases, while the infilling models mutate these cases to explore a vast input space, aiming to maximize test coverage and identify bugs that arise from complex interactions within API sequences.

The experimental evaluation on PyTorch and TensorFlow demonstrates TitanFuzz's capability to achieve substantially higher code coverage and API reach compared to state-of-the-art fuzzers. Specifically, TitanFuzz achieves 30.38% and 50.84% higher code coverage than the best existing fuzzers on TensorFlow and PyTorch, respectively. These results are achieved within a reasonable time frame, considering the complexity and scale of using LLMs in a feedback-driven fuzzing loop.

Strong Numerical Outcomes and Implications

TitanFuzz successfully uncovers 65 bugs in DL libraries, with 41 previously unknown. The paper reports notable findings, including examples of detected bugs that other techniques could not identify, such as complex bugs involving API sequences that interact non-trivially with input data. These results highlight the capability of modern LLMs to provide a comprehensive and diverse range of valid inputs by implicitly learning the constraints and relationships between various DL APIs through extensive corpora exposure.

The implications of this research are manifold. Practically, the successful use of LLMs in this domain suggests a potential to apply this approach to other complex software systems beyond DL libraries, such as compilers, database systems, and SMT solvers. Theoretically, it reinforces the potential of LLMs in software testing and verification tasks, demonstrating their strength in performing tasks traditionally thought to require explicit constraint-solving or complex program analysis.

Conclusion and Future Directions

The authors present a compelling case for the utility of LLMs in fuzzing testing, opening new avenues for research in automated software testing and model-based testing frameworks. Future research might explore refining the methods for prompt engineering, improving the efficiency of the LLM generation processes, and extending these techniques to other domains where legacy fuzz testing is challenging. The application of LLMs in this context not only enhances the robustness of DL library testing but also signals a broader shift towards data-driven approaches in software quality assurance.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Yinlin Deng (11 papers)
Chunqiu Steven Xia (13 papers)
Haoran Peng (5 papers)
Chenyuan Yang (12 papers)
Lingming Zhang (48 papers)

Citations (144)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/LingmingZhang/status/1835757003462832541