Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FuzzCoder: Byte-level Fuzzing Test via Large Language Model (2409.01944v1)

Published 3 Sep 2024 in cs.CL

Abstract: Fuzzing is an important dynamic program analysis technique designed for finding vulnerabilities in complex software. Fuzzing involves presenting a target program with crafted malicious input to cause crashes, buffer overflows, memory errors, and exceptions. Crafting malicious inputs in an efficient manner is a difficult open problem and the best approaches often apply uniform random mutations to pre-existing valid inputs. In this work, we propose to adopt fine-tuned LLMs (FuzzCoder) to learn patterns in the input files from successful attacks to guide future fuzzing explorations. Specifically, we develop a framework to leverage the code LLMs to guide the mutation process of inputs in fuzzing. The mutation process is formulated as the sequence-to-sequence modeling, where LLM receives a sequence of bytes and then outputs the mutated byte sequence. FuzzCoder is fine-tuned on the created instruction dataset (Fuzz-Instruct), where the successful fuzzing history is collected from the heuristic fuzzing tool. FuzzCoder can predict mutation locations and strategies locations in input files to trigger abnormal behaviors of the program. Experimental results show that FuzzCoder based on AFL (American Fuzzy Lop) gain significant improvements in terms of effective proportion of mutation (EPM) and number of crashes (NC) for various input formats including ELF, JPG, MP3, and XML.

Overview of "FuzzCoder: Byte-level Fuzzing Test via LLM"

"FuzzCoder: Byte-level Fuzzing Test via LLM" introduces an innovative approach to dynamic program analysis, specifically focusing on fuzzing—a method used to uncover vulnerabilities by subjecting software to malformed inputs. The paper proposes a framework that leverages LLMs to enhance the efficiency and effectiveness of byte-level fuzzing through an intelligent mutation process.

Summary and Key Contributions

The paper presents FuzzCoder, a tool designed to improve the process of input mutation, which is crucial for effective fuzzing. FuzzCoder's framework is underpinned by LLMs fine-tuned on a dataset named Fuzz-Instruct. This dataset is carefully curated from successful fuzzing histories, aiding the LLM in learning patterns and strategies for input mutations that are more likely to expose software vulnerabilities.

Three main contributions are highlighted:

  1. Sequence-to-Sequence Modeling for Input Mutation:
    • The mutation process is framed as a sequence-to-sequence task, where the LLM receives a sequence of bytes and outputs a mutated byte sequence. This approach allows the model to predict both the location and type of mutations.
  2. Fuzz-Instruct Dataset Construction:
    • A comprehensive dataset is created by collecting mutation instances from heuristic fuzzing tools. This dataset is used to fine-tune the LLMs, enhancing their ability to predict effective mutations for various input formats, such as ELF, JPG, MP3, and XML.
  3. Fuzz-Bench Evaluation Framework:
    • FuzzCoder is evaluated using a newly constructed benchmark—Fuzz-Bench—consisting of eight programs. The results indicate significant improvements in metrics such as effective proportion of mutation (EPM) and number of crashes (NC) compared to traditional fuzzing methods.

Methodology

Mutation Process as Sequence-to-Sequence Modeling

The paper positions the mutation process within the context of sequence-to-sequence modeling. This involves converting the data into byte sequences and leveraging LLMs to predict where and how to mutate these bytes to maximize the likelihood of triggering software vulnerabilities. The LLMs are fine-tuned on the Fuzz-Instruct dataset, allowing them to understand and generate byte-level data effectively.

Fuzz-Instruct Dataset

Fuzz-Instruct is a collected corpus formed from the successful mutation instances recorded from heuristic fuzzing tools like AFL (American Fuzzy Lop). Each entry in this dataset consists of an original input sequence paired with its successfully mutated counterpart, providing valuable training data for the LLMs.

Evaluation with Fuzz-Bench

FuzzCoder was extensively evaluated using the Fuzz-Bench benchmark, which comprises programs such as NM_ELF, READ_ELF, OBJDUMP_ELF, and others. These programs accept a variety of input formats and represent different domains of software applications.

Experimental Results

  • Effective Proportion of Mutation (EPM): FuzzCoder outperformed baseline methods (AFL with heuristic and small models) in the EPM metric across all eight programs. Notably, the CodeQwen and DeepSeek-Coder models consistently achieved higher EPM.
  • Number of Crashes (NC): FuzzCoder demonstrated a higher number of crashes compared to baseline methods, indicating that it can uncover more vulnerabilities. For instance, in the READ_ELF benchmark, FuzzCoder with CodeQwen achieved nine crashes compared to zero crashes for the baselines.
  • Code Coverage: FuzzCoder showed superior performance in terms of line, branch, and function coverage in comparison to traditional methods, demonstrating its efficacy in exploring a broader range of execution paths within the target programs.

Implications and Future Directions

The introduction of FuzzCoder has several implications for both practical and theoretical domains in fuzz testing:

  • Practical Implications:
    • By integrating LLMs into fuzzing workflows, FuzzCoder significantly enhances the ability to detect software vulnerabilities efficiently.
    • The framework can be adapted to various fuzzing tools and input formats, making it a versatile tool in the software testing arsenal.
  • Theoretical Implications:
    • The sequence-to-sequence modeling approach offers a novel perspective on the fuzzing process, potentially opening new avenues for research in dynamic program analysis and input mutation strategies.
    • The success of fine-tuning domain-specific LLMs on custom datasets like Fuzz-Instruct provides a blueprint for constructing more domain-focused models in other areas of software engineering.

Conclusion

This comprehensive paper encapsulates a significant advancement in fuzz testing methodologies by exploiting the power of LLMs. FuzzCoder proposes an effective and efficient framework for input mutation, demonstrated through rigorous evaluation on the Fuzz-Bench benchmark. This work charts a promising path for future developments in AI-driven software vulnerability detection, highlighting the potential of LLMs to revolutionize this critical field.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  2. Qwen technical report. arXiv preprint arXiv:2309.16609.
  3. xcot: Cross-lingual instruction tuning for cross-lingual chain-of-thought reasoning. arXiv preprint arXiv:2401.07037.
  4. Compiler fuzzing through deep learning. In Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 95–105.
  5. Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models. In Proceedings of the 32nd ACM SIGSOFT international symposium on software testing and analysis, pages 423–435.
  6. Learn&fuzz: Machine learning for input fuzzing. In 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 50–59. IEEE.
  7. Deep Learning. MIT Press. http://www.deeplearningbook.org.
  8. Deepseek-coder: When the large language model meets programming–the rise of code intelligence. arXiv preprint arXiv:2401.14196.
  9. Owl: A large language model for it operations. arXiv preprint arXiv:2309.09298.
  10. Lemur: Log parsing with entropy sampling and chain-of-thought merging. arXiv preprint arXiv:2402.18205.
  11. Dlfuzz: Differential fuzzing testing of deep learning systems. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 739–743.
  12. Learning to fuzz from symbolic execution with application to smart contracts. In Proceedings of the 2019 ACM SIGSAC conference on computer and communications security, pages 531–548.
  13. Large language models based fuzzing techniques: A survey. arXiv preprint arXiv:2402.00350.
  14. Fuzzing: a survey. Cybersecurity, 1(1):1–13.
  15. Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
  16. The art, science, and engineering of fuzzing: A survey. IEEE Transactions on Software Engineering, 47(11):2312–2331.
  17. Jibesh Patra and Michael Pradel. 2016. Learning to fuzz: Application-independent fuzz testing with probabilistic, generative models of input data. TU Darmstadt, Department of Computer Science, Tech. Rep. TUD-CS-2016-14664.
  18. Language models are unsupervised multitask learners.
  19. Code Llama: Open foundation models for code. arXiv preprint arXiv:2308.12950.
  20. Kevin Slagle. 2024. Spacebyte: Towards deleting tokenization from large language modeling. arXiv preprint arXiv:2404.14408.
  21. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  22. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  23. Attention is all you need. In NIPS 2017, pages 5998–6008.
  24. Neural machine translation with byte-level subwords. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 9154–9160. AAAI Press.
  25. Free lunch for testing: Fuzzing deep-learning libraries from open source. In Proceedings of the 44th International Conference on Software Engineering, pages 995–1007.
  26. Beyond language models: Byte models are digital world simulators. arXiv preprint arXiv:2402.19155.
  27. Fuzz4all: Universal fuzzing with large language models. arXiv preprint arXiv:2308.04748.
  28. Docter: documentation-guided fuzzing for testing deep learning api functions. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 176–188.
  29. Alternating language modeling for cross-lingual pre-training. In AAAI 2020, pages 9386–9393.
  30. High-resource language-specific training for multilingual neural machine translation. In IJCAI 2022, pages 4461–4467.
  31. UM4: unified multilingual multiple teacher-student model for zero-resource neural machine translation. In IJCAI 2022, pages 4454–4460.
  32. Seq2seq-afl: Fuzzing via sequence-to-sequence model. International Journal of Machine Learning and Cybernetics, pages 1–19.
  33. Fuzzllm: A novel and universal fuzzing framework for proactively discovering jailbreak vulnerabilities in large language models. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4485–4489. IEEE.
  34. A critical review of large language model on software engineering: An example from chatgpt and automated program repair. arXiv preprint arXiv:2310.08879.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (16)
  1. Liqun Yang (18 papers)
  2. Jian Yang (505 papers)
  3. Chaoren Wei (4 papers)
  4. Guanglin Niu (20 papers)
  5. Ge Zhang (170 papers)
  6. Yunli Wang (13 papers)
  7. Linzheng Chai (16 papers)
  8. Wanxu Xia (1 paper)
  9. Hongcheng Guo (39 papers)
  10. Shun Zhang (105 papers)
  11. Jiaheng Liu (100 papers)
  12. Yuwei Yin (21 papers)
  13. Junran Peng (30 papers)
  14. Jiaxin Ma (6 papers)
  15. Liang Sun (124 papers)
  16. Zhoujun Li (122 papers)