This paper investigates the robustness of state-of-the-art neural LLMs (LMs) like BERT, RoBERTa, XLNet, and ELMo to various input perturbations that simulate real-world noise (Moradi et al., 2021 ). The core argument is that standard benchmark evaluations often fail to capture how these models perform when encountering slightly noisy or altered text, which is common in practical applications.
To evaluate robustness, the researchers designed and implemented a comprehensive set of character-level and word-level perturbation methods. These methods aim to mimic realistic noise scenarios rather than worst-case adversarial attacks.
Character-Level Perturbations:
- Insertion: Randomly insert a character into a word.
- Deletion: Randomly delete a character from a word (not first or last).
- Replacement: Replace a character with an adjacent one on the keyboard.
- Swapping: Swap adjacent characters within a word.
- Repetition: Repeat a character within a word.
- Common Misspelled Words (CMW): Replace words with common misspellings from a Wikipedia list.
- Letter Case Changing (LCC): Randomly change the case of the first character or all characters in a word.
Word-Level Perturbations:
- Deletion: Randomly remove a word from the text.
- Repetition: Randomly repeat a word.
- Replacement With Synonyms (RWS): Replace words with synonyms from WordNet.
- Negation: Add or remove negation from verbs.
- Singular/Plural Verbs (SPV): Swap singular/plural verb forms.
- Verb Tense (VT): Change verb tense (e.g., present to past).
- Word Order (WO): Randomly reorder a sequence of M consecutive words.
The paper applied these perturbations to the test sets of five diverse NLP tasks:
- Text Classification (TC): TREC dataset
- Sentiment Analysis (SA): Stanford Sentiment Treebank (SST)
- Named Entity Recognition (NER): CoNLL-2003 dataset
- Semantic Similarity (SS): STS benchmark
- Question Answering (QA): WikiQA dataset
The four LLMs (BERT-LARGE, RoBERTa-LARGE, XLNet-LARGE, ELMo) were fine-tuned on the original training data for each task and then evaluated on both the original test sets and the perturbed versions. The number of perturbations applied per sample (PPS) was varied (1 to 4) to paper the effect of noise intensity.
Key Findings:
- Significant Performance Drop: All models showed a notable decrease in performance even with a single perturbation per sample (PPS=1). Performance degradation increased with higher PPS values.
- Sensitivity to Perturbation Type: Models were generally more sensitive to character-level perturbations than word-level ones across most tasks.
- Task-Dependent Robustness: Sentiment Analysis appeared most vulnerable to perturbations, while Question Answering was comparatively less affected.
- Model-Specific Strengths/Weaknesses:
- RoBERTa generally achieved the highest scores on perturbed data, likely due to its robust pretraining.
- ELMo, with its character-based convolutions, showed relatively better resilience to some character-level perturbations compared to its overall performance.
- XLNet demonstrated better handling of word order perturbations, potentially due to its permutation LLMing objective.
- Models pretrained on larger corpora (RoBERTa, XLNet) were more robust to synonym replacements.
- User Study: A user paper confirmed that most character-level perturbations and some word-level ones (Repetition, SPV, VT) produced understandable text preserving the original meaning (94% judged understandable/meaning-preserving). However, perturbations like Deletion, RWS, Negation, and Word Order often changed meaning or rendered the text meaningless, requiring manual curation for fair robustness evaluation (only 39% preserved meaning and label consistency).
Practical Implications:
- The results strongly suggest that benchmark accuracy alone is insufficient to gauge the real-world reliability of NLP models.
- Evaluating models on perturbed data, using methods like those presented, should become a standard practice to get a more realistic understanding of robustness.
- The provided perturbation methods and open-source code (available on GitHub) offer a practical toolkit for developers to test their NLP systems against common noise types.
- The findings highlight specific vulnerabilities (e.g., sensitivity to character noise, word order) that need improvement in future model development.
In conclusion, the paper provides a systematic methodology and empirical evidence demonstrating the fragility of modern LMs to realistic input noise, advocating for the integration of robustness testing into standard NLP evaluation pipelines.