Revisit Input Perturbation Problems for LLMs: A Unified Robustness Evaluation Framework for Noisy Slot Filling Task (2310.06504v1)
Abstract: With the increasing capabilities of LLMs, these high-performance models have achieved state-of-the-art results on a wide range of NLP tasks. However, the models' performance on commonly-used benchmark datasets often fails to accurately reflect their reliability and robustness when applied to real-world noisy data. To address these challenges, we propose a unified robustness evaluation framework based on the slot-filling task to systematically evaluate the dialogue understanding capability of LLMs in diverse input perturbation scenarios. Specifically, we construct a input perturbation evaluation dataset, Noise-LLM, which contains five types of single perturbation and four types of mixed perturbation data. Furthermore, we utilize a multi-level data augmentation method (character, word, and sentence levels) to construct a candidate data pool, and carefully design two ways of automatic task demonstration construction strategies (instance-level and entity-level) with various prompt templates. Our aim is to assess how well various robustness methods of LLMs perform in real-world noisy scenarios. The experiments have demonstrated that the current open-source LLMs generally achieve limited perturbation robustness performance. Based on these experimental observations, we make some forward-looking suggestions to fuel the research in this direction.
- Guanting Dong (46 papers)
- Jinxu Zhao (5 papers)
- Tingfeng Hui (10 papers)
- Daichi Guo (8 papers)
- Wenlong Wan (4 papers)
- Boqi Feng (1 paper)
- Yueyan Qiu (3 papers)
- Keqing He (47 papers)
- Zechen Wang (15 papers)
- Weiran Xu (58 papers)
- Zhuoma GongQue (7 papers)