Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 77 tok/s

Gemini 2.5 Pro 54 tok/s Pro

GPT-5 Medium 29 tok/s Pro

GPT-5 High 26 tok/s Pro

GPT-4o 103 tok/s Pro

Kimi K2 175 tok/s Pro

GPT OSS 120B 454 tok/s Pro

Claude Sonnet 4.5 38 tok/s Pro

2000 character limit reached

BASFuzz: Towards Robustness Evaluation of LLM-based NLP Software via Automated Fuzz Testing (2509.17335v1)

Published 22 Sep 2025 in cs.SE

Abstract: Fuzzing has shown great success in evaluating the robustness of intelligent NLP software. As LLM-based NLP software is widely deployed in critical industries, existing methods still face two main challenges: 1 testing methods are insufficiently coupled with the behavioral patterns of LLM-based NLP software; 2 fuzzing capability for the testing scenario of natural language generation (NLG) generally degrades. To address these issues, we propose BASFuzz, an efficient Fuzz testing method tailored for LLM-based NLP software. BASFuzz targets complete test inputs composed of prompts and examples, and uses a text consistency metric to guide mutations of the fuzzing loop, aligning with the behavioral patterns of LLM-based NLP software. A Beam-Annealing Search algorithm, which integrates beam search and simulated annealing, is employed to design an efficient fuzzing loop. In addition, information entropy-based adaptive adjustment and an elitism strategy further enhance fuzzing capability. We evaluate BASFuzz on six datasets in representative scenarios of NLG and natural language understanding (NLU). Experimental results demonstrate that BASFuzz achieves a testing effectiveness of 90.335% while reducing the average time overhead by 2,163.852 seconds compared to the current best baseline, enabling more effective robustness evaluation prior to software deployment.