Regularized Dropout (R-Drop) in Deep Learning

Updated 11 May 2026

Regularized Dropout (R-Drop) is a training method that extends standard dropout by enforcing consistency between outputs from stochastic sub-models.
It minimizes the bidirectional KL divergence between predictions under different dropout masks to align training and inference behaviors.
Empirical results show enhanced performance in tasks like translation, summarization, and image classification with improved BLEU, accuracy, and perplexity metrics.

Regularized Dropout (R-Drop) is a training paradigm that extends standard dropout by explicitly enforcing the consistency of neural network outputs across different stochastic sub-models sampled via dropout masks. The technique introduces an additional regularization term that compels the predictive distributions of multiple dropout-instantiated sub-models on the same input to align closely. This approach addresses known discrepancies between the behavior of models during training (with dropout-induced stochasticity) and inference (deterministic, without dropout), thereby improving generalization and robustness across diverse deep learning tasks (Liang et al., 2021, Zolna et al., 2017).

1. Motivation and Rationale

Dropout, as introduced by Hinton et al. (2012), regularizes deep networks by deactivating each neuron with probability $1-p$ during training. While dropout prevents co-adaptation of units and yields ensembling effects by exploring a combinatorially large set of subnetworks, an inconsistency remains: at inference, the unmodified (full) network with rescaled weights is used, diverging from the subnetworks actually exposed during training. Recent analyses demonstrate that this discrepancy can contribute to degraded performance, as the distribution of outputs at inference can differ nontrivially from those observed under dropout (Liang et al., 2021, Zolna et al., 2017).

R-Drop addresses this issue by minimizing the divergence between the predictive distributions of any two stochastic sub-models, thereby encouraging ensemble-like behavior and reducing variance across output states. The technique is closely related to earlier approaches such as Fraternal Dropout in RNNs (Zolna et al., 2017), which penalized logit variance under mask perturbations to foster mask-invariant representations.

2. Formal Definition and Loss Construction

Given parameters $\theta$ , input $x$ , and ground-truth label $y$ , R-Drop proceeds by executing two forward passes through the model under independently sampled dropout masks, yielding distributions $p_\theta(\cdot|x)$ and $q_\theta(\cdot|x)$ . The composite per-sample loss consists of two components:

Task loss (Negative Log-Likelihood):

$L_{NLL}(y,x;\theta) = -\log p_\theta(y|x) - \log q_\theta(y|x)$

Bidirectional KL Regularization:

$L_{RDrop}(x;\theta) = D_{KL}(p_\theta(\cdot|x) \parallel q_\theta(\cdot|x)) + D_{KL}(q_\theta(\cdot|x) \parallel p_\theta(\cdot|x))$

The total loss minimized for each training example is

$L^i(\theta) = L_{NLL}(y_i, x_i; \theta) + \lambda \cdot L_{RDrop}(x_i; \theta)$

where $\lambda$ is a hyperparameter controlling the regularization strength. Over the data distribution $\theta$ 0, the expected loss is

$\theta$ 1

Alternatives such as Fraternal Dropout previously focused on the pre-softmax logit difference, using an $\theta$ 2 penalty instead of a KL-divergence (Zolna et al., 2017).

3. Theoretical Properties

R-Drop can be interpreted as optimizing the task objective under the constraint that the average pairwise KL divergence across subnetworks is bounded by $\theta$ 3. The Lagrangian formulation yields the combined loss above. Theoretically, for a linear softmax classifier, if the expected bidirectional KL-divergence among subnetworks is bounded by $\theta$ 4, the train-inference negative log-likelihood (NLL) gap is upper-bounded by $\theta$ 5, where $\theta$ 6 depends on softmax and normalization Lipschitz constants (Liang et al., 2021).

Penalizing KL divergence between outputs reduces the effective function space accessible to parameter updates, thereby complementing standard dropout (which constrains model structure through masking). In the Fraternal Dropout formulation, the logit-level variance regularizer is related by Jensen’s inequality to expectation-linear dropout (ELD), i.e., it is upper-bounded by the squared distance from the stochastic mask predictions to those under the “average” mask (Zolna et al., 2017).

4. Training Algorithm and Implementation

The R-Drop training step involves:

Batch Duplication: Each training batch is duplicated such that for each $\theta$ 7, two identical inputs are present.
Dual Forward Passes: Both copies are fed through the model with independently sampled dropout masks, yielding $\theta$ 8 and $\theta$ 9 for each $x$ 0.
Per-sample Loss Computation:
- NLL losses: $x$ 1 and $x$ 2
- Bidirectional KL: $x$ 3
Total Batch Loss:

$x$ 4
Backpropagation: Gradient updates are applied to $x$ 5.

Common hyperparameter values are $x$ 6 for neural machine translation, $x$ 7 for ViT fine-tuning, and $x$ 8 for GLUE fine-tuning. Computational cost is $x$ 9 that of standard dropout, dominated by the additional forward pass and KL gradients. Variants include sampling more than two submodels (averaging all pairwise KLs) and reducing the regularization frequency to every $y$ 0 steps, though best performance is observed with $y$ 1 and $y$ 2 (Liang et al., 2021). Memory usage roughly doubles relative to standard dropout, but batch size reduction can compensate (Zolna et al., 2017).

5. Empirical Performance and Evaluation

R-Drop demonstrates substantial improvements across five major tasks on eighteen datasets encompassing language and vision domains:

Task/Model	Baseline	R-Drop-enhanced	Metric/Delta
IWSLT NMT	32.44 BLEU	34.68 BLEU	+2.24 BLEU
WMT14 En $y$ 3De	29.12 BLEU	30.91 BLEU	+1.79 BLEU
WMT14 En $y$ 4Fr	42.69 BLEU	43.95 BLEU	+1.26 BLEU
CNN/DM Summ. (BART)	44.16/21.28/40.90	44.51/21.58/41.24	(R1/R2/RL)
GLUE (RoBERTa-large)	88.93	89.73	+0.80 acc.
ViT-B/16 (CIFAR-100)	92.64%	93.29%	+0.65% acc.

Additional gains are observed in language modeling (WikiText-103 PPL: 26.62 → 24.94) and image classification (ImageNet ViT-L/16: 85.15% → 85.57%). Ablation studies confirm best results for $y$ 5 subnetworks, $y$ 6, and applying the penalty at every step. R-Drop outperforms comparable strategies such as batch-doubling and Fraternal Dropout (winning by ≈2 BLEU on IWSLT) (Liang et al., 2021). Comparable experiments in (Zolna et al., 2017) with fraternal dropout show consistent improvements in perplexity for AWD-LSTM models and in BLEU for image captioning.

6. Analysis and Practical Considerations

R-Drop is model-agnostic and straightforward to incorporate into any PyTorch or TensorFlow training loop employing dropout; only batch duplication and KL penalty computation are required. The approach yields slightly slower convergence but leads to decreased overfitting and smaller training–validation loss gaps. R-Drop is robust to modeling scale, benefiting architectures from small models to those exceeding 300M parameters, and is particularly effective when fine-tuning large pre-trained models such as ViT, RoBERTa-large, and BART (Liang et al., 2021).

At inference, R-Drop has no overhead: only the standard deterministic (full) model is retained, with dropout disabled. Typical implementation tips include conservative batch size adjustment to manage memory, tuning $y$ 7 or $y$ 8 via validation, and maintaining identical dropout rates for the dual predictions.

R-Drop (as “Fraternal Dropout” in (Zolna et al., 2017)) can be viewed as an instance of variance-reducing regularization, closely tied to mask-invariance objectives and expectation-linear dropout. The regularization force acts in the output space rather than the hidden state space and is quantitatively linked to the variance of output logits or probabilities under the dropout mask distribution. The approach is distinct from, but theoretically subsumes, earlier consistency-based regularizers by exploiting the symmetrized KL as a penalty.

A key differentiator is its explicit enforcement of consistency at the distributional output level, rather than solely at the intermediate feature level, and its drop-in applicability across modalities, including recurrent, convolutional, and transformer architectures. Results indicate consistent improvements in both generalization error and robustness (Liang et al., 2021, Zolna et al., 2017).

Markdown Report Issue Upgrade to Chat

References (2)

R-Drop: Regularized Dropout for Neural Networks (2021)

Fraternal Dropout (2017)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Regularized Dropout (R-Drop).