Papers
Topics
Authors
Recent
Search
2000 character limit reached

Regularized Dropout (R-Drop) in Deep Learning

Updated 11 May 2026
  • Regularized Dropout (R-Drop) is a training method that extends standard dropout by enforcing consistency between outputs from stochastic sub-models.
  • It minimizes the bidirectional KL divergence between predictions under different dropout masks to align training and inference behaviors.
  • Empirical results show enhanced performance in tasks like translation, summarization, and image classification with improved BLEU, accuracy, and perplexity metrics.

Regularized Dropout (R-Drop) is a training paradigm that extends standard dropout by explicitly enforcing the consistency of neural network outputs across different stochastic sub-models sampled via dropout masks. The technique introduces an additional regularization term that compels the predictive distributions of multiple dropout-instantiated sub-models on the same input to align closely. This approach addresses known discrepancies between the behavior of models during training (with dropout-induced stochasticity) and inference (deterministic, without dropout), thereby improving generalization and robustness across diverse deep learning tasks (Liang et al., 2021, Zolna et al., 2017).

1. Motivation and Rationale

Dropout, as introduced by Hinton et al. (2012), regularizes deep networks by deactivating each neuron with probability $1-p$ during training. While dropout prevents co-adaptation of units and yields ensembling effects by exploring a combinatorially large set of subnetworks, an inconsistency remains: at inference, the unmodified (full) network with rescaled weights is used, diverging from the subnetworks actually exposed during training. Recent analyses demonstrate that this discrepancy can contribute to degraded performance, as the distribution of outputs at inference can differ nontrivially from those observed under dropout (Liang et al., 2021, Zolna et al., 2017).

R-Drop addresses this issue by minimizing the divergence between the predictive distributions of any two stochastic sub-models, thereby encouraging ensemble-like behavior and reducing variance across output states. The technique is closely related to earlier approaches such as Fraternal Dropout in RNNs (Zolna et al., 2017), which penalized logit variance under mask perturbations to foster mask-invariant representations.

2. Formal Definition and Loss Construction

Given parameters θ\theta, input xx, and ground-truth label yy, R-Drop proceeds by executing two forward passes through the model under independently sampled dropout masks, yielding distributions pθ(x)p_\theta(\cdot|x) and qθ(x)q_\theta(\cdot|x). The composite per-sample loss consists of two components:

LNLL(y,x;θ)=logpθ(yx)logqθ(yx)L_{NLL}(y,x;\theta) = -\log p_\theta(y|x) - \log q_\theta(y|x)

  • Bidirectional KL Regularization:

LRDrop(x;θ)=DKL(pθ(x)qθ(x))+DKL(qθ(x)pθ(x))L_{RDrop}(x;\theta) = D_{KL}(p_\theta(\cdot|x) \parallel q_\theta(\cdot|x)) + D_{KL}(q_\theta(\cdot|x) \parallel p_\theta(\cdot|x))

The total loss minimized for each training example is

Li(θ)=LNLL(yi,xi;θ)+λLRDrop(xi;θ)L^i(\theta) = L_{NLL}(y_i, x_i; \theta) + \lambda \cdot L_{RDrop}(x_i; \theta)

where λ\lambda is a hyperparameter controlling the regularization strength. Over the data distribution θ\theta0, the expected loss is

θ\theta1

Alternatives such as Fraternal Dropout previously focused on the pre-softmax logit difference, using an θ\theta2 penalty instead of a KL-divergence (Zolna et al., 2017).

3. Theoretical Properties

R-Drop can be interpreted as optimizing the task objective under the constraint that the average pairwise KL divergence across subnetworks is bounded by θ\theta3. The Lagrangian formulation yields the combined loss above. Theoretically, for a linear softmax classifier, if the expected bidirectional KL-divergence among subnetworks is bounded by θ\theta4, the train-inference negative log-likelihood (NLL) gap is upper-bounded by θ\theta5, where θ\theta6 depends on softmax and normalization Lipschitz constants (Liang et al., 2021).

Penalizing KL divergence between outputs reduces the effective function space accessible to parameter updates, thereby complementing standard dropout (which constrains model structure through masking). In the Fraternal Dropout formulation, the logit-level variance regularizer is related by Jensen’s inequality to expectation-linear dropout (ELD), i.e., it is upper-bounded by the squared distance from the stochastic mask predictions to those under the “average” mask (Zolna et al., 2017).

4. Training Algorithm and Implementation

The R-Drop training step involves:

  1. Batch Duplication: Each training batch is duplicated such that for each θ\theta7, two identical inputs are present.
  2. Dual Forward Passes: Both copies are fed through the model with independently sampled dropout masks, yielding θ\theta8 and θ\theta9 for each xx0.
  3. Per-sample Loss Computation:
    • NLL losses: xx1 and xx2
    • Bidirectional KL: xx3
  4. Total Batch Loss:

    xx4

  5. Backpropagation: Gradient updates are applied to xx5.

Common hyperparameter values are xx6 for neural machine translation, xx7 for ViT fine-tuning, and xx8 for GLUE fine-tuning. Computational cost is xx9 that of standard dropout, dominated by the additional forward pass and KL gradients. Variants include sampling more than two submodels (averaging all pairwise KLs) and reducing the regularization frequency to every yy0 steps, though best performance is observed with yy1 and yy2 (Liang et al., 2021). Memory usage roughly doubles relative to standard dropout, but batch size reduction can compensate (Zolna et al., 2017).

5. Empirical Performance and Evaluation

R-Drop demonstrates substantial improvements across five major tasks on eighteen datasets encompassing language and vision domains:

Task/Model Baseline R-Drop-enhanced Metric/Delta
IWSLT NMT 32.44 BLEU 34.68 BLEU +2.24 BLEU
WMT14 Enyy3De 29.12 BLEU 30.91 BLEU +1.79 BLEU
WMT14 Enyy4Fr 42.69 BLEU 43.95 BLEU +1.26 BLEU
CNN/DM Summ. (BART) 44.16/21.28/40.90 44.51/21.58/41.24 (R1/R2/RL)
GLUE (RoBERTa-large) 88.93 89.73 +0.80 acc.
ViT-B/16 (CIFAR-100) 92.64% 93.29% +0.65% acc.

Additional gains are observed in language modeling (WikiText-103 PPL: 26.62 → 24.94) and image classification (ImageNet ViT-L/16: 85.15% → 85.57%). Ablation studies confirm best results for yy5 subnetworks, yy6, and applying the penalty at every step. R-Drop outperforms comparable strategies such as batch-doubling and Fraternal Dropout (winning by ≈2 BLEU on IWSLT) (Liang et al., 2021). Comparable experiments in (Zolna et al., 2017) with fraternal dropout show consistent improvements in perplexity for AWD-LSTM models and in BLEU for image captioning.

6. Analysis and Practical Considerations

R-Drop is model-agnostic and straightforward to incorporate into any PyTorch or TensorFlow training loop employing dropout; only batch duplication and KL penalty computation are required. The approach yields slightly slower convergence but leads to decreased overfitting and smaller training–validation loss gaps. R-Drop is robust to modeling scale, benefiting architectures from small models to those exceeding 300M parameters, and is particularly effective when fine-tuning large pre-trained models such as ViT, RoBERTa-large, and BART (Liang et al., 2021).

At inference, R-Drop has no overhead: only the standard deterministic (full) model is retained, with dropout disabled. Typical implementation tips include conservative batch size adjustment to manage memory, tuning yy7 or yy8 via validation, and maintaining identical dropout rates for the dual predictions.

R-Drop (as “Fraternal Dropout” in (Zolna et al., 2017)) can be viewed as an instance of variance-reducing regularization, closely tied to mask-invariance objectives and expectation-linear dropout. The regularization force acts in the output space rather than the hidden state space and is quantitatively linked to the variance of output logits or probabilities under the dropout mask distribution. The approach is distinct from, but theoretically subsumes, earlier consistency-based regularizers by exploiting the symmetrized KL as a penalty.

A key differentiator is its explicit enforcement of consistency at the distributional output level, rather than solely at the intermediate feature level, and its drop-in applicability across modalities, including recurrent, convolutional, and transformer architectures. Results indicate consistent improvements in both generalization error and robustness (Liang et al., 2021, Zolna et al., 2017).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Regularized Dropout (R-Drop).