Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Evaluating the Robustness of Neural Language Models to Input Perturbations (2108.12237v1)

Published 27 Aug 2021 in cs.CL and cs.AI

Abstract: High-performance neural LLMs have obtained state-of-the-art results on a wide range of NLP tasks. However, results for common benchmark datasets often do not reflect model reliability and robustness when applied to noisy, real-world data. In this study, we design and implement various types of character-level and word-level perturbation methods to simulate realistic scenarios in which input texts may be slightly noisy or different from the data distribution on which NLP systems were trained. Conducting comprehensive experiments on different NLP tasks, we investigate the ability of high-performance LLMs such as BERT, XLNet, RoBERTa, and ELMo in handling different types of input perturbations. The results suggest that LLMs are sensitive to input perturbations and their performance can decrease even when small changes are introduced. We highlight that models need to be further improved and that current benchmarks are not reflecting model robustness well. We argue that evaluations on perturbed inputs should routinely complement widely-used benchmarks in order to yield a more realistic understanding of NLP systems robustness.

This paper investigates the robustness of state-of-the-art neural LLMs (LMs) like BERT, RoBERTa, XLNet, and ELMo to various input perturbations that simulate real-world noise (Moradi et al., 2021 ). The core argument is that standard benchmark evaluations often fail to capture how these models perform when encountering slightly noisy or altered text, which is common in practical applications.

To evaluate robustness, the researchers designed and implemented a comprehensive set of character-level and word-level perturbation methods. These methods aim to mimic realistic noise scenarios rather than worst-case adversarial attacks.

Character-Level Perturbations:

  • Insertion: Randomly insert a character into a word.
  • Deletion: Randomly delete a character from a word (not first or last).
  • Replacement: Replace a character with an adjacent one on the keyboard.
  • Swapping: Swap adjacent characters within a word.
  • Repetition: Repeat a character within a word.
  • Common Misspelled Words (CMW): Replace words with common misspellings from a Wikipedia list.
  • Letter Case Changing (LCC): Randomly change the case of the first character or all characters in a word.

Word-Level Perturbations:

  • Deletion: Randomly remove a word from the text.
  • Repetition: Randomly repeat a word.
  • Replacement With Synonyms (RWS): Replace words with synonyms from WordNet.
  • Negation: Add or remove negation from verbs.
  • Singular/Plural Verbs (SPV): Swap singular/plural verb forms.
  • Verb Tense (VT): Change verb tense (e.g., present to past).
  • Word Order (WO): Randomly reorder a sequence of M consecutive words.

The paper applied these perturbations to the test sets of five diverse NLP tasks:

  1. Text Classification (TC): TREC dataset
  2. Sentiment Analysis (SA): Stanford Sentiment Treebank (SST)
  3. Named Entity Recognition (NER): CoNLL-2003 dataset
  4. Semantic Similarity (SS): STS benchmark
  5. Question Answering (QA): WikiQA dataset

The four LLMs (BERT-LARGE, RoBERTa-LARGE, XLNet-LARGE, ELMo) were fine-tuned on the original training data for each task and then evaluated on both the original test sets and the perturbed versions. The number of perturbations applied per sample (PPS) was varied (1 to 4) to paper the effect of noise intensity.

Key Findings:

  • Significant Performance Drop: All models showed a notable decrease in performance even with a single perturbation per sample (PPS=1). Performance degradation increased with higher PPS values.
  • Sensitivity to Perturbation Type: Models were generally more sensitive to character-level perturbations than word-level ones across most tasks.
  • Task-Dependent Robustness: Sentiment Analysis appeared most vulnerable to perturbations, while Question Answering was comparatively less affected.
  • Model-Specific Strengths/Weaknesses:
    • RoBERTa generally achieved the highest scores on perturbed data, likely due to its robust pretraining.
    • ELMo, with its character-based convolutions, showed relatively better resilience to some character-level perturbations compared to its overall performance.
    • XLNet demonstrated better handling of word order perturbations, potentially due to its permutation LLMing objective.
    • Models pretrained on larger corpora (RoBERTa, XLNet) were more robust to synonym replacements.
  • User Study: A user paper confirmed that most character-level perturbations and some word-level ones (Repetition, SPV, VT) produced understandable text preserving the original meaning (94% judged understandable/meaning-preserving). However, perturbations like Deletion, RWS, Negation, and Word Order often changed meaning or rendered the text meaningless, requiring manual curation for fair robustness evaluation (only 39% preserved meaning and label consistency).

Practical Implications:

  • The results strongly suggest that benchmark accuracy alone is insufficient to gauge the real-world reliability of NLP models.
  • Evaluating models on perturbed data, using methods like those presented, should become a standard practice to get a more realistic understanding of robustness.
  • The provided perturbation methods and open-source code (available on GitHub) offer a practical toolkit for developers to test their NLP systems against common noise types.
  • The findings highlight specific vulnerabilities (e.g., sensitivity to character noise, word order) that need improvement in future model development.

In conclusion, the paper provides a systematic methodology and empirical evidence demonstrating the fragility of modern LMs to realistic input noise, advocating for the integration of robustness testing into standard NLP evaluation pipelines.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Milad Moradi (23 papers)
  2. Matthias Samwald (36 papers)
Citations (84)