Papers
Topics
Authors
Recent
Search
2000 character limit reached

Weight-Dropped LSTM

Updated 16 March 2026
  • Weight-dropped LSTM is a recurrent network that applies DropConnect to hidden-to-hidden matrices to reduce overfitting in sequence models.
  • The method integrates stochastic recurrent weight masking with conventional LSTM recurrence, yielding state-of-the-art perplexity on PTB and WikiText-2 datasets.
  • Empirical results demonstrate that weight-dropping improves long-range temporal stability and overall generalization without disrupting optimized LSTM kernels.

A weight-dropped LSTM is a Long Short-Term Memory (LSTM) network in which the recurrent (hidden-to-hidden) weight matrices are regularized using DropConnect. This technique, introduced in "Regularizing and Optimizing LSTM LLMs" (Merity et al., 2017), aims to improve generalization in sequence modeling—particularly word-level language modeling—by stochastically masking elements of the recurrent matrices during training. This approach focuses on mitigating overfitting and enhancing long-range temporal stability without interfering with black-box, highly-optimized LSTM kernels such as those in cuDNN.

1. Standard LSTM Recurrence Relations

The conventional LSTM processes sequential data via the following recurrence:

it=σ(Wixt+Uiht1+bi) ft=σ(Wfxt+Ufht1+bf) ot=σ(Woxt+Uoht1+bo) c~t=tanh(Wcxt+Ucht1+bc) ct=ftct1+itc~t ht=ottanh(ct)\begin{aligned} i_t &= \sigma(W^i x_t + U^i h_{t-1} + b^i) \ f_t &= \sigma(W^f x_t + U^f h_{t-1} + b^f) \ o_t &= \sigma(W^o x_t + U^o h_{t-1} + b^o) \ \tilde{c}_t &= \tanh(W^c x_t + U^c h_{t-1} + b^c) \ c_t &= f_t \odot c_{t-1} + i_t \odot \tilde{c}_t \ h_t &= o_t \odot \tanh(c_t) \end{aligned}

Here, xtRdx_t \in \mathbb{R}^d is the input at time tt, ht,ctRHh_t, c_t \in \mathbb{R}^H the hidden and cell state, WW^\cdot and UU^\cdot the input-to-hidden and hidden-to-hidden weight matrices, bb^\cdot biases, σ\sigma the sigmoid nonlinearity, and \odot the elementwise product.

2. DropConnect-Based Recurrent Regularization

In a weight-dropped LSTM, DropConnect is applied to each hidden-to-hidden matrix UU^\cdot. For each U{Ui,Uf,Uo,Uc}U \in \{U^i, U^f, U^o, U^c\}, a binary mask MM is sampled:

MjkBernoulli(1prec)M_{jk} \sim \mathrm{Bernoulli}(1 - p_{\mathrm{rec}})

yielding a masked recurrent matrix

U~=UM\widetilde{U} = U \odot M

Throughout a given batch, U~\widetilde{U} remains fixed for both forward and backward passes. The recurrence is rewritten as

it=σ(Wixt+(UiMi)ht1+bi)i_t = \sigma\bigl(W^i x_t + (U^i \odot M^i)\, h_{t-1} + b^i\bigr)

with analogous masking for Uf,Uo,UcU^f, U^o, U^c. This stochastically drops recurrent weights across batches, preventing parameter co-adaptation over time steps. No changes are needed to cuDNN kernels aside from substituting UU with U~\widetilde{U}.

3. Model Hyperparameters and Regularization

The weight-dropped LSTM is typically integrated into a broader regularization and optimization pipeline. The key hyperparameters, as empirically tuned on Penn Treebank (PTB) and WikiText-2 (WT2), are:

Parameter PTB Value WT2 Value
# layers (stacked LSTM) 3 3
Hidden size HH 1150 1150
Embedding size dd 400 400
Input dropout (on xtx_t) 0.4 0.65
Inter-layer dropout 0.3 0.3
Output dropout 0.4 0.4
Embedding dropout pep_e 0.1 0.1
Recurrent DropConnect precp_{rec} 0.5 0.5
Activation regularization α\alpha 2 2
Temporal AR β\beta 1 1

Variational dropout masks are used per-batch for input, inter-layer, and output connections. Embedding dropout is used on token embeddings. Weight tying is employed between input embedding and output softmax layers.

Sequence-length jitter is introduced for BPTT: LN(70,5)L \sim \mathcal{N}(70, 5) with 95% probability, else LN(35,5)L \sim \mathcal{N}(35, 5), and the learning rate is scaled by L/70L/70.

4. Optimization: Non-Monotonic Triggered ASGD

Optimization is performed by NT-ASGD, a variant of Averaged Stochastic Gradient Descent (ASGD) wherein the averaging trigger is determined by a non-monotonic validation error criterion. Define wkw_k as parameters at step kk, with updates:

wk+1=wkγ^f(wk)w_{k+1} = w_k - \gamma \, \widehat{\nabla} f(w_k)

with fixed γ=30\gamma = 30. Every epoch, the validation perplexity vtv_t is logged. Averaging commences when vt>min{vt5,,vt1}v_t > \min \{ v_{t-5}, \ldots, v_{t-1} \}, setting trigger T=kT = k. The final parameters are

wavg=1KT+1i=TKwiw_\mathrm{avg} = \frac{1}{K - T + 1} \sum_{i=T}^K w_i

This adaptive mechanism removes the need for manual selection of TT as in classical ASGD.

5. Experimental Protocol

Experiments are conducted on Penn Treebank and WikiText-2, preprocessed to a vocabulary of 10k (PTB) and ~33k (WT2). Training employs 40 (PTB) or 80 (WT2) batch size, 750 epochs to reach the NT-ASGD trigger, and max gradient norm of 0.25 for clipping. After NT-ASGD, a single ASGD fine-tuning pass with an analogous trigger rule (but T=0T=0) is applied. Computation is executed on NVIDIA GPUs using cuDNN LSTM for efficiency.

6. Empirical Results

Single-model perplexities for AWD-LSTM (ASGD Weight-Dropped LSTM) are:

Dataset Model #Params Val PPL Test PPL
PTB AWD-LSTM (3×1150, drop-drop) 24M 60.0 57.3
PTB AWD-LSTM + neural cache 24M 53.9 52.8
WT2 AWD-LSTM (3×1150, increased inp drop) 33M 68.6 65.8
WT2 AWD-LSTM + neural cache 33M 53.8 52.0

DropConnect applied to recurrent weights directly yields these state-of-the-art perplexity results. Removing weight-dropped recurrence resulted in a degradation of ~11 points (PTB) and ~9 points (WT2); omitting embedding or variational dropout yielded 2–6 point increases, indicating weight-dropped regularization is critical for generalization.

7. Analysis and Implications

Weight-dropping regularizes the recurrent transition dynamics at the parameter level, analogously to variational dropout on hidden activations. By masking a subset of UU's weights per batch, it prevents overfitting to specific recurrent pathways and mitigates co-adaptation. The method is fully compatible with optimized LSTM kernels, since the masked weights are computed once per batch and require only a single additional elementwise multiply per gate.

The approach acts synergistically with other regularization techniques (embedding dropout, AR/TAR penalties, weight tying), substantially improving long-range stability and generalization, as evidenced by ablation studies and held-out perplexity metrics (Merity et al., 2017). This suggests weight-dropped LSTM variants are well-suited for language modeling applications demanding robustness to overfitting, especially in low-resource or small-vocabulary settings.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Weight-Dropped LSTM.