This paper provides a comprehensive survey on defining, measuring, and improving robustness in NLP models. It aims to unify various research lines scattered across different communities and tasks.
1. Defining Robustness:
Robustness is broadly defined as a model's ability to maintain performance when tested on data (x′,y′) drawn from a distribution D′ that differs from the training distribution D. The paper categorizes the construction of D′ into two main types:
- Adversarial Attacks (Synthetic Shifts): D′ is created by applying perturbations (e.g., character/word swaps, paraphrasing, adding distractors) to the original input x. Key assumptions include whether the perturbation is label-preserving (y′=y) or label-changing (y′=y), and whether it's semantic-preserving or semantic-modifying.
- Distribution Shift (Natural Shifts): D′ represents naturally occurring variations, such as differences in domains, dialects, grammar, or time periods. This is closely related to domain generalization and fairness concerns (e.g., gender bias).
The paper notes that robustness in text generation tasks (e.g., avoiding hallucination, positional bias) is less formally defined due to challenges in robust evaluation metrics. It also highlights the common underlying issue of models relying on spurious correlations (features correlating with labels in D but not D′) as a key reason for lack of robustness in both synthetic and natural shifts. Connections are also drawn to model instability and poor uncertainty calibration under distribution shifts.
2. Comparing Robustness in Vision and NLP:
While drawing inspiration from computer vision, NLP robustness presents unique challenges:
- Discrete vs. Continuous: Text is discrete, making gradient-based attacks from vision less directly applicable.
- Perceptibility: Vision attacks often aim for imperceptible changes, whereas NLP attacks focus on meaning preservation despite perceptible changes.
- Data Distributions: NLP domain shifts can involve different supports (vocabularies), unlike vision where the pixel space support is usually shared.
3. Identifying Robustness Failures:
Methods for identifying failures include:
- Human Prior and Error Analyses: Manually analyzing model errors on specific tasks (NLI, QA, MT, parsing, generation) to understand failure modes (e.g., hypothesis-only bias, reliance on keywords, sensitivity to noise) and creating stress tests or challenge datasets (e.g., HANS, AdvSQuAD). This also includes identifying dataset biases and annotation artifacts.
- Model-based Identification: Using task-agnostic methods like automated text attacks (e.g., TextFooler, HotFlip), universal adversarial triggers, or training auxiliary models to explicitly capture dataset biases or spurious shortcuts. Human-in-the-loop and model-in-the-loop approaches (e.g., ANLI, Dynabench) are used to create challenging benchmarks.
4. Improving Model Robustness:
Mitigation strategies are categorized as:
- Data-driven: Using data augmentation techniques like Mixup, MixText, AugMix, or adding adversarial/counterfactual examples (generated via methods like CAT-Gen or human annotation) to the training data.
- Model and Training-based: Leveraging large pre-trained models (shown to improve OOD robustness), focusing training on minority or hard-to-learn examples (e.g., Group DRO, re-weighting schemes), or fine-tuning strategies.
- Inductive-prior-based: Introducing explicit biases or regularizers to discourage reliance on spurious features, often by ensembling with a "bias-only" model or using techniques inspired by domain adaptation (e.g., DANN, IRM). SMART and InfoBERT are mentioned as regularization techniques for fine-tuning.
- Causal Intervention: Using causal analysis to identify and mitigate confounders or spurious correlations by learning (approximately) counterfactually invariant predictors.
The paper notes many mitigation methods connect conceptually, often aiming to counter known spurious patterns via augmentation, re-weighting, ensembling, or regularization.
5. Open Questions and Future Directions:
The survey concludes by highlighting key challenges:
- Automatically discovering unknown robustness failures without relying solely on human priors.
- Developing better methods for interpreting why models fail and mitigating spurious correlations, potentially without sacrificing in-distribution performance.
- Creating unified, easy-to-use evaluation frameworks (e.g., CheckList, Robustness Gym, Dynabench).
- Incorporating user-centric perspectives and human cognitive priors into robustness measures and mitigation.
- Understanding the connection between human linguistic generalization and model generalization.
Overall, the paper advocates for a more unified and systematic approach to understanding and addressing robustness issues in NLP, emphasizing the need for comprehensive benchmarks, exploration of transferability between different robustness types, and development of more effective mitigation strategies.