Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Measure and Improve Robustness in NLP Models: A Survey (2112.08313v2)

Published 15 Dec 2021 in cs.CL and cs.LG

Abstract: As NLP models achieved state-of-the-art performances over benchmarks and gained wide applications, it has been increasingly important to ensure the safe deployment of these models in the real world, e.g., making sure the models are robust against unseen or challenging scenarios. Despite robustness being an increasingly studied topic, it has been separately explored in applications like vision and NLP, with various definitions, evaluation and mitigation strategies in multiple lines of research. In this paper, we aim to provide a unifying survey of how to define, measure and improve robustness in NLP. We first connect multiple definitions of robustness, then unify various lines of work on identifying robustness failures and evaluating models' robustness. Correspondingly, we present mitigation strategies that are data-driven, model-driven, and inductive-prior-based, with a more systematic view of how to effectively improve robustness in NLP models. Finally, we conclude by outlining open challenges and future directions to motivate further research in this area.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Xuezhi Wang (64 papers)
  2. Haohan Wang (96 papers)
  3. Diyi Yang (151 papers)
Citations (120)

Summary

This paper provides a comprehensive survey on defining, measuring, and improving robustness in NLP models. It aims to unify various research lines scattered across different communities and tasks.

1. Defining Robustness:

Robustness is broadly defined as a model's ability to maintain performance when tested on data (x,yx', y') drawn from a distribution D\mathcal{D}' that differs from the training distribution D\mathcal{D}. The paper categorizes the construction of D\mathcal{D}' into two main types:

  • Adversarial Attacks (Synthetic Shifts): D\mathcal{D}' is created by applying perturbations (e.g., character/word swaps, paraphrasing, adding distractors) to the original input xx. Key assumptions include whether the perturbation is label-preserving (y=yy'=y) or label-changing (yyy' \neq y), and whether it's semantic-preserving or semantic-modifying.
  • Distribution Shift (Natural Shifts): D\mathcal{D}' represents naturally occurring variations, such as differences in domains, dialects, grammar, or time periods. This is closely related to domain generalization and fairness concerns (e.g., gender bias).

The paper notes that robustness in text generation tasks (e.g., avoiding hallucination, positional bias) is less formally defined due to challenges in robust evaluation metrics. It also highlights the common underlying issue of models relying on spurious correlations (features correlating with labels in D\mathcal{D} but not D\mathcal{D}') as a key reason for lack of robustness in both synthetic and natural shifts. Connections are also drawn to model instability and poor uncertainty calibration under distribution shifts.

2. Comparing Robustness in Vision and NLP:

While drawing inspiration from computer vision, NLP robustness presents unique challenges:

  • Discrete vs. Continuous: Text is discrete, making gradient-based attacks from vision less directly applicable.
  • Perceptibility: Vision attacks often aim for imperceptible changes, whereas NLP attacks focus on meaning preservation despite perceptible changes.
  • Data Distributions: NLP domain shifts can involve different supports (vocabularies), unlike vision where the pixel space support is usually shared.

3. Identifying Robustness Failures:

Methods for identifying failures include:

  • Human Prior and Error Analyses: Manually analyzing model errors on specific tasks (NLI, QA, MT, parsing, generation) to understand failure modes (e.g., hypothesis-only bias, reliance on keywords, sensitivity to noise) and creating stress tests or challenge datasets (e.g., HANS, AdvSQuAD). This also includes identifying dataset biases and annotation artifacts.
  • Model-based Identification: Using task-agnostic methods like automated text attacks (e.g., TextFooler, HotFlip), universal adversarial triggers, or training auxiliary models to explicitly capture dataset biases or spurious shortcuts. Human-in-the-loop and model-in-the-loop approaches (e.g., ANLI, Dynabench) are used to create challenging benchmarks.

4. Improving Model Robustness:

Mitigation strategies are categorized as:

  • Data-driven: Using data augmentation techniques like Mixup, MixText, AugMix, or adding adversarial/counterfactual examples (generated via methods like CAT-Gen or human annotation) to the training data.
  • Model and Training-based: Leveraging large pre-trained models (shown to improve OOD robustness), focusing training on minority or hard-to-learn examples (e.g., Group DRO, re-weighting schemes), or fine-tuning strategies.
  • Inductive-prior-based: Introducing explicit biases or regularizers to discourage reliance on spurious features, often by ensembling with a "bias-only" model or using techniques inspired by domain adaptation (e.g., DANN, IRM). SMART and InfoBERT are mentioned as regularization techniques for fine-tuning.
  • Causal Intervention: Using causal analysis to identify and mitigate confounders or spurious correlations by learning (approximately) counterfactually invariant predictors.

The paper notes many mitigation methods connect conceptually, often aiming to counter known spurious patterns via augmentation, re-weighting, ensembling, or regularization.

5. Open Questions and Future Directions:

The survey concludes by highlighting key challenges:

  • Automatically discovering unknown robustness failures without relying solely on human priors.
  • Developing better methods for interpreting why models fail and mitigating spurious correlations, potentially without sacrificing in-distribution performance.
  • Creating unified, easy-to-use evaluation frameworks (e.g., CheckList, Robustness Gym, Dynabench).
  • Incorporating user-centric perspectives and human cognitive priors into robustness measures and mitigation.
  • Understanding the connection between human linguistic generalization and model generalization.

Overall, the paper advocates for a more unified and systematic approach to understanding and addressing robustness issues in NLP, emphasizing the need for comprehensive benchmarks, exploration of transferability between different robustness types, and development of more effective mitigation strategies.