R-Drop: Regularized Dropout for Neural Networks (2106.14448v2)

Published 28 Jun 2021 in cs.LG

Abstract: Dropout is a powerful and widely used technique to regularize the training of deep neural networks. In this paper, we introduce a simple regularization strategy upon dropout in model training, namely R-Drop, which forces the output distributions of different sub models generated by dropout to be consistent with each other. Specifically, for each training sample, R-Drop minimizes the bidirectional KL-divergence between the output distributions of two sub models sampled by dropout. Theoretical analysis reveals that R-Drop reduces the freedom of the model parameters and complements dropout. Experiments on $\bf{5}$ widely used deep learning tasks ($\bf{18}$ datasets in total), including neural machine translation, abstractive summarization, language understanding, LLMing, and image classification, show that R-Drop is universally effective. In particular, it yields substantial improvements when applied to fine-tune large-scale pre-trained models, e.g., ViT, RoBERTa-large, and BART, and achieves state-of-the-art (SOTA) performances with the vanilla Transformer model on WMT14 English$\to$German translation ($\bf{30.91}$ BLEU) and WMT14 English$\to$French translation ($\bf{43.95}$ BLEU), even surpassing models trained with extra large-scale data and expert-designed advanced variants of Transformer models. Our code is available at GitHub{\url{https://github.com/dropreg/R-Drop}}.

Authors (9)

Xiaobo Liang (6 papers)
Lijun Wu (113 papers)
Juntao Li (89 papers)
Yue Wang (676 papers)
Qi Meng (50 papers)
Tao Qin (201 papers)
Wei Chen (1290 papers)
Min Zhang (630 papers)
Tie-Yan Liu (242 papers)

Citations (401)

View on Semantic Scholar

Summary

The paper introduces R-Drop, a method that minimizes dropout-induced inconsistencies by aligning sub-model outputs using bidirectional KL divergence.
The paper demonstrates significant performance gains across five tasks, achieving state-of-the-art BLEU scores in neural machine translation and improvements on language and vision benchmarks.
The paper addresses practical challenges by discussing increased training costs and suggesting adjustments like modifying dropout rates to enhance model robustness.

R-Drop: Regularized Dropout for Neural Networks

Dropout is a well-established technique for regularizing deep neural networks, providing an implicit ensemble method by randomly omitting hidden units during training. Despite its efficacy, dropout generates a discrepancy between training and inference phases due to this stochasticity. The paper "R-Drop: Regularized Dropout for Neural Networks" introduces R-Drop, a novel approach aimed at minimizing this inconsistency by reducing the variance in sub-model outputs generated by dropout.

Methodology

R-Drop implements a training strategy that augments conventional dropout with a consistency constraint using the bidirectional Kullback-Leibler (KL) divergence to ensure sub-models' outputs align for each training sample. This method entails performing two forward passes for each sample within a mini-batch, with dropout applied independently to these passes, resulting in two different sub-model outputs. R-Drop then minimizes the divergence between these outputs, promoting consistency without altering model architecture.

The analysis provided in the paper illustrates that R-Drop effectively reduces the inconsistency that typically arises between training with dropout and inference without. This theoretical grounding offers assurance of R-Drop's capacity to mitigate one of dropout's primary limitations.

Experimental Results

The strong experimental results serve as a testament to R-Drop's effectiveness. The technique was evaluated across five core tasks spanning 18 datasets: neural machine translation (NMT), abstractive summarization, language understanding, LLMing, and image classification. In NMT, R-Drop achieved substantial performance improvements. For instance, it reached state-of-the-art results on the WMT14 English-to-German and English-to-French translation tasks, recording BLEU scores of 30.91 and 43.95, respectively, surpassing models that incorporated extensive datasets or sophisticated variants of Transformer architectures.

Language understanding benchmarks demonstrated similar trends. Fine-tuning with R-Drop on BERT and RoBERTa models yielded notable improvements across the GLUE benchmark, with an average increase of 1.21 points with a BERT-base model.

In abstractive summarization using BART on the CNN/DailyMail dataset, R-Drop pushed the performance higher, exceeding well-designed contemporaries such as PEGASUS. Image classification, tested on CIFAR-100 and ImageNet, also benefited from R-Drop, showing accuracy advancements on pre-trained Vision Transformer models.

Implications and Future Directions

R-Drop offers a straightforward enhancement applicable to various models, including those deployed in fine-tuning stages of extensive pre-trained systems. Its success points to a broader applicability in any scenario where dropout's variability might compromise inference readiness. Future advancements could explore employing R-Drop during pre-training stages, which may yield further improvements by regularizing earlier layers.

The constraint of higher training costs due to per-sample double forward passes is notable, yet the authors provide insights into operational adjustments, like altering dropout rates or batch sizes, to offset this cost without sacrificing improvement.

Conclusion

R-Drop manifests as a proficient upgrade to conventional dropout usage in neural networks, enhancing performance by addressing dropout-induced inconsistencies. Its strong numerical outcomes across diverse tasks underscore its universality, offering a valuable tool for researchers to enhance model robustness and generalization.

PDF Markdown

Related Papers

GitHub

GitHub - dropreg/R-Drop (879 stars)