Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

R-Drop: Regularized Dropout for Neural Networks (2106.14448v2)

Published 28 Jun 2021 in cs.LG

Abstract: Dropout is a powerful and widely used technique to regularize the training of deep neural networks. In this paper, we introduce a simple regularization strategy upon dropout in model training, namely R-Drop, which forces the output distributions of different sub models generated by dropout to be consistent with each other. Specifically, for each training sample, R-Drop minimizes the bidirectional KL-divergence between the output distributions of two sub models sampled by dropout. Theoretical analysis reveals that R-Drop reduces the freedom of the model parameters and complements dropout. Experiments on $\bf{5}$ widely used deep learning tasks ($\bf{18}$ datasets in total), including neural machine translation, abstractive summarization, language understanding, LLMing, and image classification, show that R-Drop is universally effective. In particular, it yields substantial improvements when applied to fine-tune large-scale pre-trained models, e.g., ViT, RoBERTa-large, and BART, and achieves state-of-the-art (SOTA) performances with the vanilla Transformer model on WMT14 English$\to$German translation ($\bf{30.91}$ BLEU) and WMT14 English$\to$French translation ($\bf{43.95}$ BLEU), even surpassing models trained with extra large-scale data and expert-designed advanced variants of Transformer models. Our code is available at GitHub{\url{https://github.com/dropreg/R-Drop}}.

Citations (401)

Summary

  • The paper introduces R-Drop, a method that minimizes dropout-induced inconsistencies by aligning sub-model outputs using bidirectional KL divergence.
  • The paper demonstrates significant performance gains across five tasks, achieving state-of-the-art BLEU scores in neural machine translation and improvements on language and vision benchmarks.
  • The paper addresses practical challenges by discussing increased training costs and suggesting adjustments like modifying dropout rates to enhance model robustness.

R-Drop: Regularized Dropout for Neural Networks

Dropout is a well-established technique for regularizing deep neural networks, providing an implicit ensemble method by randomly omitting hidden units during training. Despite its efficacy, dropout generates a discrepancy between training and inference phases due to this stochasticity. The paper "R-Drop: Regularized Dropout for Neural Networks" introduces R-Drop, a novel approach aimed at minimizing this inconsistency by reducing the variance in sub-model outputs generated by dropout.

Methodology

R-Drop implements a training strategy that augments conventional dropout with a consistency constraint using the bidirectional Kullback-Leibler (KL) divergence to ensure sub-models' outputs align for each training sample. This method entails performing two forward passes for each sample within a mini-batch, with dropout applied independently to these passes, resulting in two different sub-model outputs. R-Drop then minimizes the divergence between these outputs, promoting consistency without altering model architecture.

The analysis provided in the paper illustrates that R-Drop effectively reduces the inconsistency that typically arises between training with dropout and inference without. This theoretical grounding offers assurance of R-Drop's capacity to mitigate one of dropout's primary limitations.

Experimental Results

The strong experimental results serve as a testament to R-Drop's effectiveness. The technique was evaluated across five core tasks spanning 18 datasets: neural machine translation (NMT), abstractive summarization, language understanding, LLMing, and image classification. In NMT, R-Drop achieved substantial performance improvements. For instance, it reached state-of-the-art results on the WMT14 English-to-German and English-to-French translation tasks, recording BLEU scores of 30.91 and 43.95, respectively, surpassing models that incorporated extensive datasets or sophisticated variants of Transformer architectures.

Language understanding benchmarks demonstrated similar trends. Fine-tuning with R-Drop on BERT and RoBERTa models yielded notable improvements across the GLUE benchmark, with an average increase of 1.21 points with a BERT-base model.

In abstractive summarization using BART on the CNN/DailyMail dataset, R-Drop pushed the performance higher, exceeding well-designed contemporaries such as PEGASUS. Image classification, tested on CIFAR-100 and ImageNet, also benefited from R-Drop, showing accuracy advancements on pre-trained Vision Transformer models.

Implications and Future Directions

R-Drop offers a straightforward enhancement applicable to various models, including those deployed in fine-tuning stages of extensive pre-trained systems. Its success points to a broader applicability in any scenario where dropout's variability might compromise inference readiness. Future advancements could explore employing R-Drop during pre-training stages, which may yield further improvements by regularizing earlier layers.

The constraint of higher training costs due to per-sample double forward passes is notable, yet the authors provide insights into operational adjustments, like altering dropout rates or batch sizes, to offset this cost without sacrificing improvement.

Conclusion

R-Drop manifests as a proficient upgrade to conventional dropout usage in neural networks, enhancing performance by addressing dropout-induced inconsistencies. Its strong numerical outcomes across diverse tasks underscore its universality, offering a valuable tool for researchers to enhance model robustness and generalization.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub