- The paper introduces R-Drop, a method that minimizes dropout-induced inconsistencies by aligning sub-model outputs using bidirectional KL divergence.
- The paper demonstrates significant performance gains across five tasks, achieving state-of-the-art BLEU scores in neural machine translation and improvements on language and vision benchmarks.
- The paper addresses practical challenges by discussing increased training costs and suggesting adjustments like modifying dropout rates to enhance model robustness.
R-Drop: Regularized Dropout for Neural Networks
Dropout is a well-established technique for regularizing deep neural networks, providing an implicit ensemble method by randomly omitting hidden units during training. Despite its efficacy, dropout generates a discrepancy between training and inference phases due to this stochasticity. The paper "R-Drop: Regularized Dropout for Neural Networks" introduces R-Drop, a novel approach aimed at minimizing this inconsistency by reducing the variance in sub-model outputs generated by dropout.
Methodology
R-Drop implements a training strategy that augments conventional dropout with a consistency constraint using the bidirectional Kullback-Leibler (KL) divergence to ensure sub-models' outputs align for each training sample. This method entails performing two forward passes for each sample within a mini-batch, with dropout applied independently to these passes, resulting in two different sub-model outputs. R-Drop then minimizes the divergence between these outputs, promoting consistency without altering model architecture.
The analysis provided in the paper illustrates that R-Drop effectively reduces the inconsistency that typically arises between training with dropout and inference without. This theoretical grounding offers assurance of R-Drop's capacity to mitigate one of dropout's primary limitations.
Experimental Results
The strong experimental results serve as a testament to R-Drop's effectiveness. The technique was evaluated across five core tasks spanning 18 datasets: neural machine translation (NMT), abstractive summarization, language understanding, LLMing, and image classification. In NMT, R-Drop achieved substantial performance improvements. For instance, it reached state-of-the-art results on the WMT14 English-to-German and English-to-French translation tasks, recording BLEU scores of 30.91 and 43.95, respectively, surpassing models that incorporated extensive datasets or sophisticated variants of Transformer architectures.
Language understanding benchmarks demonstrated similar trends. Fine-tuning with R-Drop on BERT and RoBERTa models yielded notable improvements across the GLUE benchmark, with an average increase of 1.21 points with a BERT-base model.
In abstractive summarization using BART on the CNN/DailyMail dataset, R-Drop pushed the performance higher, exceeding well-designed contemporaries such as PEGASUS. Image classification, tested on CIFAR-100 and ImageNet, also benefited from R-Drop, showing accuracy advancements on pre-trained Vision Transformer models.
Implications and Future Directions
R-Drop offers a straightforward enhancement applicable to various models, including those deployed in fine-tuning stages of extensive pre-trained systems. Its success points to a broader applicability in any scenario where dropout's variability might compromise inference readiness. Future advancements could explore employing R-Drop during pre-training stages, which may yield further improvements by regularizing earlier layers.
The constraint of higher training costs due to per-sample double forward passes is notable, yet the authors provide insights into operational adjustments, like altering dropout rates or batch sizes, to offset this cost without sacrificing improvement.
Conclusion
R-Drop manifests as a proficient upgrade to conventional dropout usage in neural networks, enhancing performance by addressing dropout-induced inconsistencies. Its strong numerical outcomes across diverse tasks underscore its universality, offering a valuable tool for researchers to enhance model robustness and generalization.