Pretrained Transformers Improve Out-of-Distribution Robustness (2004.06100v2)

Published 13 Apr 2020 in cs.CL and cs.LG

Abstract: Although pretrained Transformers such as BERT achieve high accuracy on in-distribution examples, do they generalize to new distributions? We systematically measure out-of-distribution (OOD) generalization for seven NLP datasets by constructing a new robustness benchmark with realistic distribution shifts. We measure the generalization of previous models including bag-of-words models, ConvNets, and LSTMs, and we show that pretrained Transformers' performance declines are substantially smaller. Pretrained transformers are also more effective at detecting anomalous or OOD examples, while many previous models are frequently worse than chance. We examine which factors affect robustness, finding that larger models are not necessarily more robust, distillation can be harmful, and more diverse pretraining data can enhance robustness. Finally, we show where future work can improve OOD robustness.

Authors (6)

Dan Hendrycks (63 papers)
Xiaoyuan Liu (44 papers)
Eric Wallace (42 papers)
Adam Dziedzic (47 papers)
Rishabh Krishnan (1 paper)
Dawn Song (229 papers)

Citations (400)

View on Semantic Scholar

Summary

Enhancing Out-of-Distribution Robustness with Pretrained Transformers

The paper by Hendrycks et al. explores the robustness of pretrained Transformer models in the context of NLP, particularly their ability to handle out-of-distribution (OOD) data. This research is significant as it provides empirical insights into the performance of these models when subjected to realistic distributional shifts, an area that traditional NLP models have struggled with.

Methodology and Experimental Setup

The research rigorously evaluates the OOD robustness of various NLP models, including pretrained Transformers like BERT and RoBERTa, against more conventional models such as bag-of-words (BoW), LSTMs, and ConvNets with word embeddings. The paper is structured around seven NLP datasets encompassing tasks like sentiment analysis, semantic similarity, question answering, and textual entailment. OOD robustness is measured in terms of both generalization to new data distributions and the ability to detect OOD examples.

To achieve this, the authors construct a benchmark that induces distribution shifts by either splitting datasets through metadata or pairing datasets with similar yet distinct characteristics. The OOD test sets are designed to reflect real-world shifts in writing style, topic, and vocabulary. This approach helps in understanding how well models can generalize beyond their original training data.

Key Findings

Robustness of Pretrained Transformers: The paper demonstrates that pretrained Transformers significantly outperform traditional models in generalizing to OOD data. For instance, while LSTM models experience performance declines of over 35% on OOD data, RoBERTa maintains or even improves performance in some scenarios. This suggests that the architecture and pretraining process of Transformers confer a robustness advantage.
Model Size and Diversity: Interestingly, the robustness does not necessarily improve with larger model sizes, contrary to trends observed in computer vision. However, pretraining on more diverse data sets enhances OOD generalization. RoBERTa's larger pretraining dataset likely contributes to its superior robustness compared to BERT.
OOD Detection Capabilities: When leveraging the maximum softmax probability for anomaly detection, pretrained Transformers exhibit superior OOD detection performance relative to non-pretrained models. Transformers demonstrate considerably lower false alarm rates, indicating a strong capacity for recognizing and mitigating unseen data distributions.

Implications and Future Directions

The enhanced robustness of pretrained Transformers has both practical and theoretical implications for NLP. Practically, these models offer more reliable performance and broader applicability in environments where data distributions are dynamic and evolving. Theoretically, these findings stimulate further inquiry into the mechanisms underlying the robustness of Transformers. Investigating aspects such as diverse pretraining data and self-supervised learning objectives could lead to innovations in model design and robustness techniques.

Moreover, the research opens avenues for refining model compression and distillation strategies, highlighting potential robustness trade-offs. Future developments could include designing self-supervised objectives that inherently enhance robustness or augmenting existing models with techniques that increase OOD detection reliability. This exploration could extend to cross-disciplinary learnings from advancements in vision and other AI domains.

In conclusion, this research establishes a comprehensive benchmark for evaluating OOD robustness and positions pretrained Transformers as a formidable choice in developing NLP systems resilient to distributional shifts. Continued exploration in this vein could significantly refine our understanding and capabilities in handling OOD data across various AI fields.

PDF Markdown